x86 Assembly 101: multiplications, repetitions and switches

In the last post I wrote about how the SHL and the SHR instructions can be used to multiply and divide by powers of 2. What if that’s not the case? What if a multiplication by a different number is required? Well, here we go…

The MUL/IMUL instructions

The MUL and IMUL (Integer MULtiplication) instructions are used to multiply two signed (with IMUL) or unsigned (with MUL) numbers. The syntaxes are the following

MUL/IMUL [r/m32]

IMUL [reg], [r/m32]

IMUL [reg], [r/m32], [immediate]

So, IMUL can have different forms, but the type of operands to be used are strictly fixed, while MUL has only one form.

Using the first form means to multiply the EAX register by the value specified as operand and to store the result in EDX:EAX (yes, two 32-bit registers are needed to store the result of a multiplication of two 32-bit numbers).

MUL/IMUL [EBX]

In the second form, the register specified as first operand is multiplied with the second operand and it’s use to store the result.

IMUL EAX, [EBX]

The third form, the one with three operands, is the most complete one: in fact the second operand is multiplied by the immediate specified as third operand. The result will be stored in the register specified as first operand.

IMUL ESI, EDI, 25

The DIV/IDIV instructions

In a similar way, the DIV/IDIV instructions are used to divide two signed (with IDIV) or unsigned (with DIV) numbers. Unlike the MUL/IMUL instructions, DIV/IDIV have only two forms:

DIV/IDIV [r/m8]

DIV/IDIV [r/m32]

If an 8-bit operand is specified, the first form is used. The AX register is divided by the operand: AL will store the quotient and AH the remainder.

If a 32-bit operand is specified, then the 64-bit number stored in EDX:EAX will be divided by the specified operand: the quotient will be stored into EAX, while the remained will go into EDX.

In both cases, if the operand is 0, a divide-by-zero exception will be raised.


To show those two (four ?) instructions, I just need to slightly modify the piece of code I used to show the use of SHL/SHR:

int main(void)
{
    unsigned int a;

    a = 0xc0de;
    a /= 5;
    a *= 9;

    return a;
}

The code generated by the compiler is:

_main:
00401000  push        ebp
00401001  mov         ebp,esp
00401003  push        ecx
00401004  mov         dword ptr [ebp-4],0C0DEh
0040100B  mov         eax,dword ptr [ebp-4]
0040100E  xor         edx,edx
00401010  mov         ecx,5
00401015  div         eax,ecx
00401017  mov         dword ptr [ebp-4],eax
0040101A  imul        edx,dword ptr [ebp-4],9
0040101E  mov         dword ptr [ebp-4],edx
00401021  mov         eax,dword ptr [ebp-4]
00401024  mov         esp,ebp
00401026  pop         ebp
00401027  ret

It is interesting to see how the EDX register has been cleaned at 0x0040100E before performing the division and the use of the three-operands form of the IMUL instruction. Weird anyway how VS chose to use the IMUL instruction even if we’re dealing with unsigned numbers. Thanks to stackoverflow.com, anyway, I pretty found an explanation: the IMUL instruction is definitely more powerful, as it allows to use different registers than EAX.


A category of instruction that has not been covered yet is the powerful one that takes care of the string operations. As far as I’ve seen around, the most used instructions of this category are the MOVS, the LODS and the STOS. Anyway, all the instructions that belong to this category stronly rely on one of the flags: the direction flag (DF).

The MOVS/MOVSB/MOVSW/MOVSD instruction

This one works more or less like a MOV, but the source and destination here are more strict. In fact, MOVS copies a byte or a word or a dword from the location pointed by [ESI] into the one pointed by [EDI]. After this movement, if the DF flag is clear, then ESI and EDI are incremented, otherwise they’re both decremented. The syntax for the MOVS instruction is the following:

MOVS [Destination], [Source]

Wait a moment. Weren’t the source and the destination of the transfer already specified in the ESI and EDI registers? Why should I even specify them again here? This information is used to know if the data to be copied will be made of bytes or words: the operands should be symbols that indicate the size and the location of both source and destination. Why that “should” in italic? Because they don’t necessarily have to indicate the correct location, but only the size, as the location is always taken from ESI and EDI.

Why the size of the data to copy is important? Because the assembler ALWAYS translates the MOVS instruction into MOVSB (in case of byte transfer) or MOVSW (in case of word transfer) or MOVSD (in case of dword transfer). MOVSB, MOVSW and MOVSD have no operands, so the syntax is:

MOVSB

MOVSW

MOVSD

An example of usage of the MOVS instruction is:

MOV ESI, DWORD PTR [EBP]
MOV EDI, DWORD PTR [EBP + 0x4h]
MOVSB

The REP prefix

Wait, this REP stuff was not in the list I just wrote. I know, but this is not even an instruction, but a prefix. What makes MOVS, LODS, STOS and all the instruction is that the REP prefix can be used with them. When REP is written before the instruction, that instruction is execute ECX times: at each execution the register is decreased and, when ECX == 0, the next instruction is executed. The syntax is just the following

REP MOVS ……

It can actually be used with the other instructions as well, but for now let’s focus on this one. The result is to copy ECX bytes/words from one location to another. Man, that’s exactly what memcpy does! So, I decided to write a simple program to check if the VS implementation of memcpy uses the REP MOVS instruction.


The code generated by VS for memcpy is pretty complex, as he tries to optimize the transfer in the best way possible. Anyway, it can be summed up with the following (slightly modified) piece of code taken from the “Reversing: Secrets of Reverse Engineering” book (which is actually found in the VS implementation):

7C924E52  mov         esi,dword ptr [ebp+0Ch]            ; Set the source
7C924E55  mov         ecx,dword ptr [ebp+10h]            ; Set the size
7C924E58  mov         eax,ecx                            ; Backup the size
7C924E5A  mov         edi,dword ptr [ebp+8]              ; Set the destination
7C924E5D  shr         ecx,2                              ; Divide the size by 4
                                                         ; This is done to use MOVSD
                                                         ; instead of MOVSB, in order
                                                         ; to speed up the transfer
7C924E60  rep movs    dword ptr es:[edi],dword ptr [esi] ; Transfer the words
7C924E62  mov         ecx, eax                           ; Move the original size
                                                         ; back to ECX
7C924E64  and         ecx,3                              ; Check if there still some
                                                         ; to be copied
7C924E67  rep movs    byte ptr es:[edi],byte ptr [esi]   ; Copy the remaining bytes

Well, that is pretty awesome, isn’t it?


The STOS/STOSB/STOSW/STOSD instruction

The STOS instruction copies the content of AL (in case of byte transfer), AX (in case of word transfer) or EAX (in case of dword transfer) to [EDI]. Again, after the copy, if the DF flag is clear, then EDI is incremented, otherwise it’s decremented. In the same way of MOVS, the syntax is:

STOS [Source]

where [Source] is used to know the size of the data, in order to transform this instruction into STOSB, STOSW or STOSD accordingly.

An example of usage is:

MOV AL, 0
MOV EDI, DWORD PTR [EBP]
STOS

While the MOVS instruction is just like memcpy when joins the REP prefix, the STOS instruction is just like memset. In the (complicated) implementation of memset used by VS I can confirm that REP STOS is used (along with other unknown-for-now) instructions.

The LODS/LODSB/LODSW/LODSD instruction

LODS transfers the value (byte, word or dword) located at [ESI] into AL (byte), AX (word) or EAX (dword). After this, if DF is clear, then ESI is incremented, otherwise it’s decremented. The syntax is:

LODS [Destination]

where [Destination] is used to know the size of the data, in order to transform this instruction into LODSB, LODSW or LODSD accordingly.

An example of this could be:

LEA ESI, EAX
LODSD
MOV EBX, EAX

I think it’s uncommon to find the LODS instruction along with the REP prefix, but for sure there are some examples of it somewhere. After all, a “REP LODS” is something like the following:

LODSB == LEA ESI, [ECX + ESI]

LODSW == LEA ESI, [ECX*2 + ESI]

LODSD == LEA ESI, [ECX*4 + ESI]

Some “new” 80186 instructions: PUSHA/POPA, ENTER/LEAVE

With the introduction of the 80186 Intel processor, some new x86 instructions were added in order to speed-up some common processes, such as pushing/popping the registers onto/from the stack and creating/destroying a stack frame.

The PUSHA instruction, in fact, pushes all the general purpose register onto the stack in a specific order: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI. Worth of note is that the ESP register is pushed as well: the value is the one, of course, that the register had before the instruction was executed. The opposite task is performed by the POPA instruction. Both of those instructions didn’t need any parameter, so, the syntax is the following:

PUSHA

POPA

Anyway, the topic I wanted to face the most is the one about the ENTER/LEAVE instructions. Those awesome instructions, in fact, respectively create and destroy a new stack frame. Let’s focus on the ENTER instruction before. How was the stack frame created before the introduction of this instruction? Well, in a simply fashioned way:

push        ebp
mov         ebp,esp
sub         esp, XXXX

where XXXX is the number of bytes reserved for the local function. Actually, in all the previous examples I showed, the SUB instruction was never there, replaced by a PUSH ECX instruction: that’s because only 4 bytes were required by the functions and the PUSH instruction is definitely faster than a SUB. Anyway, the general form is the previous one.

Well, with the ENTER instruction is possible to compress the three previous instructions into just one. The syntax for ENTER is:

ENTER [NumberOfBytes], [NestedLevel]

The NumberOfBytes operand is a r/m16 and is just like the XXXX operand for the SUB instruction: the number of bytes of dynamic storage allocated onto the stack for the routine. The second operator, instead, is a r/m8 and tells the nesting level of the function inside the original source code: i.e. the number of EBPs pushed before the current one. If this one is 0, then ENTER pushes EBP, then subtracts the first operand from the stack pointer and sets the frame pointer to the current stack-pointer value.

Useless to say that the LEAVE instruction just performs the opposite operation: the epilogue. It just doesn’t need any operator, so the syntax is the simplest one:

LEAVE

The LEAVE instruction is perfectly equivalent to the following sequence of instructions:

mov esp,ebp
pop ebp

Sadly, the ENTER instruction, even if more compact than its equivalent sequence of instructions (6 bytes instead of 9), it’s definitely slower: it takes 15 clock cycles instead of 6. This is why it’s almost never used by compilers, not even in the optimize-size scenarios. For LEAVE things work differently, as it’s sometimes used by compilers to optimize the epilogues.

Strange Case of Switch-Case

When it’s about switch-case constructs in C/C++, that becomes painful in x86 ASM. The first and totally unefficient idea that could come up in mind for the implementation of such construct is to transform it into a series of ifs: maybe this is feasible when there are just a few case statements. Is there a better idea? Well, compilers know.

The most efficient approach is the generation of a pointer table, i.e. a table where the pointers to each destination block are stored. Then, when the execution time comes, the operand inside the switch statement is used as an index for the table itself. Usually this table is written after the function containing the switch statement, but this is not a rigid rule.

I wrote the following simple program in order to study how the switch statement is translated into ASM code:

#include<stdio.h>

int compute(int a)
{
    if (a > 0)
        return a + compute(--a);

    return 0;
}

int main(int argc, char **argv)
{
    if (argc == 2)
    {
        int number = atoi(argv[1]);

        switch (number)
        {
        case 0:
            number = compute(0);
            break;
        case 1:
            number = compute(1);
            break;
        case 2:
            number = compute(2);
            break;
        case 3:
        case 4:
            number = compute(3);
            break;
        default:
            number = compute(4);
        }

        printf("The result is %d\n", number);
    }
    else
    {
        return -1;
    }
}

which was translated into (I will put here just the assembly code of the main function:

_main:
00401020  push        ebp
00401021  mov         ebp,esp
00401023  cmp         dword ptr [ebp+8],2
00401027  jne         0040108F
00401029  mov         eax,dword ptr [ebp+0Ch]
0040102C  push        dword ptr [eax+4]
0040102F  call        004010AE                           ; atoi
00401034  add         esp,4
00401037  cmp         eax,4
0040103A  ja          0040106F
0040103C  jmp         dword ptr [eax*4+00401094h]        ; jump to the right block
00401043  xor         eax,eax                            ; case 0
00401045  jmp         0040107C
00401047  push        0                                  ; case 1
00401049  call        00401000                           ; the compute function
0040104E  add         esp,4
00401051  jmp         0040107C
00401053  push        1                                  ; case 2
00401055  call        00401000
0040105A  add         esp,4
0040105D  inc         eax
0040105E  jmp         0040107C
00401060  push        2                                  ; case 3-4
00401062  call        00401000
00401067  add         esp,4
0040106A  add         eax,2
0040106D  jmp         0040107C
0040106F  push        3                                  ; default
00401071  call        00401000
00401076  add         esp,4
00401079  add         eax,3
0040107C  push        eax                                ; printf
0040107D  push        403000h
00401082  call        dword ptr ds:[0040209Ch]
00401088  add         esp,8
0040108B  xor         eax,eax
0040108D  pop         ebp
0040108E  ret
0040108F  or          eax,0FFFFFFFFh                     ; return -1
00401092  pop         ebp
00401093  ret
00401094  DD          00401043
00401098  DD          00401047
0040109C  DD          00401053
004010A0  DD          00401060
004010A4  DD          00401060

As you can see, there’s some data code at the end of the function: that’s the pointer table. Anyway, I’ll explain here how the whole construct was compiled. First thing, the block starts at 0x00401037, with the CMP instruction: in fact, if the operand is greater than 4, a jump to the default block is performed (the default block is at 0x0040106F.

Then there is the bad-looking JMP instruction at 0x0040103C: this one reads the destination address from the table (stored in the memory’s area 0x00401094-0x004010A7) the destination address and uses the operand as the index of the table. In fact, by reading the contents of the table, it’s easy to realize that it really contains the starting addresses of the various blocks: 0x00401043, 0x00401047, etc.

At the end of each block, there is a JMP to the printf function: that’s just the break C instruction translated as a jump outside of the switch statement.

There are cases, anyway, where this logic isn’t applicable (a big gap between the different indexes?): in those cases, the compiler implements a binary tree search approach to reach the destination. The idea is to split the items into two groups (containing the same number of elements) using their values as criteria. The process is repeated until the single case is individuated. I forced this situation by changing the previous cases to gapped values, such as 0, 1, 20, 300 and 4000, and by leaving the remaining code as it is. What I got is:

00401020  push        ebp
00401021  mov         ebp,esp
00401023  cmp         dword ptr [ebp+8],2
00401027  jne         0040109F
00401029  mov         eax,dword ptr [ebp+0Ch]
0040102C  push        dword ptr [eax+4]
0040102F  call        004010AA
00401034  add         esp,4
00401037  cmp         eax,14h                            ; First switch
0040103A  jg          00401062                           ; is the op > 20?
0040103C  je          00401055                           ; is the op == 20?
0040103E  sub         eax,0                              ; like a cmp
00401041  je          00401051                           ; is the op == 0?
00401043  dec         eax
00401044  jne         00401070                           ; go default
00401046  push        eax                                ; case 1
00401047  call        00401000                           ; the compute function
0040104C  add         esp,4
0040104F  jmp         0040108C                           ; break
00401051  xor         eax,eax                            ; case 0
00401053  jmp         0040108C                           ; break
00401055  push        1                                  ; case 20
00401057  call        00401000                           ; the compute function
0040105C  add         esp,4
0040105F  inc         eax
00401060  jmp         0040108C                           ; break
00401062  cmp         eax,12Ch                           ; is the op == 300?
00401067  je          0040107F
00401069  cmp         eax,0FA0h                          ; is the op == 4000?
0040106E  je          0040107F
00401070  push        3                                  ; default case
00401072  call        00401000                           ; the compute function
00401077  add         esp,4
0040107A  add         eax,3
0040107D  jmp         0040108C                           ; break
0040107F  push        2                                  ; case 300-4000
00401081  call        00401000                           ; the compute function
00401086  add         esp,4
00401089  add         eax,2
0040108C  push        eax
0040108D  push        403000h
00401092  call        dword ptr ds:[0040209Ch]           ; printf
00401098  add         esp,8
0040109B  xor         eax,eax                            ; return 0
0040109D  pop         ebp
0040109E  ret
0040109F  or          eax,0FFFFFFFFh                     ; return -1
004010A2  pop         ebp
004010A3  ret

I tried, again, to comment the code as much as possible in order to not have a complicated explanation after.

Well, I’d say this is enough for now. Actually I think I covered all the basic x86 instructions needed to start understanding some x86 code. I think the next time I will go through the next step of this path: memory overflows and their exploitation.

Again feel free to comment this post if you liked it or, even more important, if you didn’t by pointing out what could be enhanced.

Best regards, guys, and see you the next time.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s