Write string.h functions using string instructions in asm x86-64
Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,800 words · 7 segments analyzed
IntroductionThe C standard library offers a bunch of functions (whose declarations can be found in the string.h header) to manage NULL-terminated strings and arrays. These are some of the most used C functions, often implemented as builtin by the C compiler, as they are crucial to the speed of programs.On the other hand, the x86 architecture contains “string instructions”, aimed at implementing operations on strings at the hardware level. Moreover, the x86 architecture was incrementally enhanced with SIMD instructions over the years, allowing for the processing of multiple bytes of data in a single instruction.In this article, we’ll inspect the implementation of string.h of the GNU standard library for x86, and see how it compares with a pure assembly implementation of these functions using string instructions and SIMD, and try to explain the choices made by the GNU developers and help you write better assembly.Disassembling a call to memcpyOne of the most popular C functions is memcpy. It copies an array of bytes to another, which is a very common operation and makes its performance particularly important.There are several ways you can perform this operation using x86 asm. Let’s see how it is implemented by gcc using this simple C program:#include <string.h> #define BUF_LEN 1024 char a[BUF_LEN]; char b[BUF_LEN]; int main(void) { memcpy(b, a, BUF_LEN); return EXIT_SUCCESS; }We can observe the generated asm by using godbolt.Or compile the code using gcc 14.2: gcc -O1 -g -o string main.cAnd then disassemble the executable using:objdump --source-comment="; " --disassembler-color=extended --disassembler-options=intel --no-show-raw-insn --disassemble=main stringYou should get this result:0000000000401134 <main>: ; ; int main(int argc, char
*argv[]) { ; memcpy(b, a, BUF_LEN); 401134: mov esi,0x404440 401139: mov edi,0x404040 40113e: mov ecx,0x80 401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi] ; return 0; ; } 401146: mov eax,0x0 40114b: retThe first surprising thing you notice is that the machine code does not contain any call to the memcpy function. It has been replaced by 3 mov instructions preceding a mysterious rep movsq instruction.rep movsq is one of the five string instructions defined in the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 5.1.8”.So it is time to learn more about these string instructions.The string instructions of x86String instructions perform operations on array elements pointed by rsi (source register) and rdi (destination register).instructionDescriptionEffect on registersmovsMove string*(rdi++) = *(rsi++)cmpsCompare stringcmp *(rsi++), *(rdi++)scasScan stringcmp rax, *(rdi++)lodsLoad stringrax = *(rsi++)stosStore string*(rdi++) = raxEach of these instructions must have a suffix (b,w,d,q) indicating the type of elements pointed by rdi and rsi (byte, word, doubleword, quadword).These instructions may also have a prefix indicating how to repeat themselves.prefixDescriptionEffect on registersrepRepeat while the ECX register not zerofor(; rcx != 0; rcx–)repe/repzRepeat while the ECX register not zero and the ZF flag is setfor(; rcx != 0 && ZF == true; rcx–)repne/repnzRepeat while the ECX register not zero and the ZF flag is clearfor(; rcx !
= 0 && ZF == false; rcx–)The repe/repz and repne/repnz prefixes are used only with the cmps and scas instructions (as they are the only ones modifying the RFLAGS register).The movs instructionNow that we have learned more about the string instructions, we can break down the effect of the rep movsq instruction:Copy the quadword pointed by rsi to rdiAdd 8 to rsi and rdi so that they point onto the next quadwordDecrement rcx and repeat until rcx == 0This is what we would expect memcpy to do, except for one thing: bytes are not copied one by one, but in blocks of 8. Here, as the byte size of our arrays is a multiple of 8, we can copy the source array as an array of quadwords. This will necessitate 8 times fewer operations than copying the array one byte at a time.Let’s change the size of the arrays to 1023 to see how the compiler will react when the array size is not a multiple of 8 anymore:0000000000401134 <main>: ; ; int main(int argc, char *argv[]) { ; memcpy(b, a, BUF_LEN); 401134: mov esi,0x404440 401139: mov edi,0x404040 40113e: mov ecx,0x7f 401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi] 401146: mov eax,DWORD PTR [rsi] 401148: mov DWORD PTR [rdi],eax 40114a: movzx eax,WORD PTR [rip+0x36eb] # 40483c <a+0x3fc> 401151: mov WORD PTR [rip+0x32e4],ax # 40443c <b+0x3fc> 401158: movzx eax,BYTE PTR [rip+0x36df]
# 40483e <a+0x3fe> 40115f: mov BYTE PTR [rip+0x32d9],al # 40443e <b+0x3fe> ; return 0; ; } 401165: mov eax,0x0 40116a: retInstead of replacing the rep movsq by the rep movsb instruction, gcc preferred to stop the repetition of the movsq instruction 8 bytes earlier and add mov instructions to copy a doubleword, a word, and a byte.The cmps instructionThe cmps instruction will compare the elements pointed by rsi and rdi and will set the flag accordingly. As cmps will set the ZF flag, we can use the repe/repz and repne/repnz prefixes to, respectively, continue until the strings differ or stop when matching characters are encountered.Let’s write a basic memcmp function using this instruction:; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp: mov rcx, rdx ; rcx = n xor eax, eax ; Set return value to zero xor edx, edx ; rdx = 0 repe cmpsb ; for(; rcx != 0 and ZF == true; rcx--) ; cmp *(rsi++), *(rdi++) setb al ; if(ZF == false and CF == true) al = 1 seta dl ; if(ZF == false and CF == false) bl = 1 sub eax, edx ; return al - dl .exit retWe use the repe cmpsb instruction to iterate over the strings s1 and s2 until two bytes differ.When we exit the repe cmpsb instruction, the RFLAGS register is set according to the last byte comparison. We can then use the set[cc] instructions to set bytes al and dl according to the result comparison.In the same way, the memcpy function copies groups of 8 bytes; we can use the repe cmpsq instruction to compare bytes in groups of 8 (or cmpsd for groups of 4 bytes on 32-bit architectures).;
int memcmp_cmpsq_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n); memcmp_cmpsq_unaligned: lea rcx, [rdx + 0x7] ; rcx = n and rdx, (8 - 1) ; rdx = n % 8 shr rcx, 3 ; rcx = n / 8 xor eax, eax ; rax = 0 repe cmpsq ; for(; rcx != 0 and ZF == true; rcx += 8) ; cmp *(rsi++), *(rdi++) je .exit ; If no difference was found return mov r8, [rdi - 0x8] ; Read the last (unaligned) quadword of s1 mov r9, [rsi - 0x8] ; Read the last (unaligned) quadword of s2 test rcx, rcx ; if(rcx != 0) jnz .cmp ; goto .cmp shl rdx, 3 ; rdx = 8 * (8 - n % 8) jz .cmp ; if(rdx == 0) goto .cmp bzhi r8, r8, rdx ; r8 <<= 8 * (8 - n % 8) bzhi r9, r9, rdx ; r9 <<= 8 * (8 - n % 8) .cmp: bswap r8 ; Convert r8 to big-endian for lexical comparison bswap r9 ; Convert r9 to big-endian for lexical comparison cmp r8, r9 ; Lexical comparison of quadwords seta al ; if (r8 > r9) al = 1 setb cl ; if (r8 < r9) cl = 1 sub eax, ecx ; return eax - ecx .exit: retTo get the result of the comparison, we need to compare the last two quadwords. However, on little-endian systems, the lowest significant byte will be the first one, and we want to compare the byte in lexical order. Hence, the need to convert the quadword to big-endian using the bswap instruction.
Zero High Bits Starting with Specified Bit PositionThe instruction bzhi is useful when you need to mask out the higher bits of a register. Here, when comparing the last quadword, we need to erase all bits in r8 and r9 that aren’t “valid” (i.e. which are not part of the input arrays).You can find documentation about this instruction here.WarningThis function should only be used for blocks of memory of size multiples of 8 with 8-byte alignment.For production use, refer to the Benchmarking section.The scas instructionThe scas instruction will compare the content of rax with the element pointed by rdi and set the flag accordingly. We can use it in a similar way to what we did for cmps, taking advantage of the repe/repz and repne/repnz prefixes.Let’s write a simple strlen function using the scasb instruction:; size_t strlen(rdi: const char *s) strlen: mov rcx, -1 repnz scasb ; for(; rcx != 0 and ZF == false; rcx--) ; cmp rax, *(rdi++) not rcx ; before this insn rcx = - (len(rdi) + 2) dec rcx ; after this insn rcx = ~(- (len(rdi) + 2)) - 1 ; = -(- (len(rdi) + 2)) - 1 - 1 ; = len(rdi) xchg rax, rcx ; rax = len(rdi) retTipThe instruction sequence used here to calculate the length of a text string is a well-known technique that dates back to the 8086 CPU.WarningDon’t use this function for production code, as it can only compare bytes one by one.For production, always prefer a loop alternative to compare groups of bytes using the largest registers (see the Benchmarking section).The lods instructionThe lods instruction will load to the rax register the element pointed to by rsi and increment rsi to point to the next element.
As this instruction does nothing but set register, it is never used with a prefix (the value would be overwritten for each repetition).It can, however, be used to examine a string, for instance, to find a character:; char* strchr_lodsb(rdi: const char* s, rsi: int c) strchr_lodsb: xchg rdi, rsi ; rdi = c, rsi = s .loop: lodsb ; al = *(rsi++) cmp dil, al ; if(c == al) je .end ; goto .end test al, al ; if(al != 0) jnz .loop ; goto .loop xor rax, rax ; return 0 ret .end: lea rax, [rsi - 1] ; return rsi - 1 retWarningFor production code, always prefer to load data using the largest registers (see the Benchmarking section).The stos instructionThe stos instruction will write the content of the rax register to the element pointed by rdi and increment rdi to point to the next element.Note that according to the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 7.3.9.2”:a REP STOS instruction is the fastest way to initialize a large block of memory.Actually, this is the way gcc will implement a memset when it knows the size and alignment of the string:#include <string.h> #define BUF_LEN 1024 char a[BUF_LEN]; int main(int argc, char *argv[]) { memset(a, 1, BUF_LEN); return 0; }000000000040115a <main>: ; ; int main(int argc, char *argv[]) { ; memset(a, 1, BUF_LEN); 40115a: mov edx,0x404460 40115f: mov ecx,0x80 401164: movabs rax,0x101010101010101 40116e: mov rdi,rdx 401171: rep stos