开发者

Finding number of operands in an instruction from opcodes

开发者 https://www.devze.com 2023-03-25 02:57 出处:网络
I am planning on writing my own small disassembler. I want to decode the opcodes which I get upon reading the executable. I see the following opcodes:

I am planning on writing my own small disassembler. I want to decode the opcodes which I get upon reading the executable. I see the following opcodes:

69 62 2f 6c 64 2d 6c

which must correspond to:

imul   $0x6c2d64开发者_如何学Python6c,0x2f(%edx),%esp

Now, the "imul" instruction can have either two or three operands. How do I figure this out from the opcodes I have there?

It's based on Intel's i386 instruction set.


Although the x86 instruction set is quite complex (it's CISC anyway) and I saw many people here are discouraging your attempts in trying to understand it, I'll say the contrary: it still can be understood, and you can learn on the way about why is it so complex and how Intel had managed to extend it several times all the way from 8086 to modern processors.

x86 instructions use variable-length encoding, so they can be made up of multiple bytes. Each byte is there to encode different things, and some of them are optional (it is encoded in the opcode whether those optional fields are used or not).

For example, each opcode can be preceded by zero to four prefix bytes, which are optional. Usually you don't need to worry about them. They are used to change the size of operands, or as escape codes to the "second floor" of the opcode table with extended instructions of modern CPUs (MMX, SSE etc.).

Then there is the actual opcode, which is usually one byte, but can be up to three bytes for extended instructions. If you use only the basic instruction set, you don't need to worry about them too.

Next, there's the so called ModR/M byte (sometimes also called mode-reg-reg/mem), which encodes the addressing mode and operand types. It's used only by opcodes which do have any such operands. It has three bit fields:

  • First two bits (from the left, most significant ones) encode the addressing mode (4 possible bit combinations).
  • Next three bits encode the first register (8 possible bit combinations).
  • The last three bits can encode another register, or extend the addressing mode, depending on what's the setup of the first two bits.

After the ModR/M byte, there could be another optional byte (depending on the addressing mode) called SIB (Scale Index Base). It is used for more exotic addressing modes to encode the scaling factor (1x,2x,4x), base address/register, and index register used. It has the similar layout as the ModR/M byte, but the first two bits from the left (most significant) are used to encode the scale, and the next three and the last three bits encode index and base registers, as the name suggests.

If there's any displacement used, it goes just after that. It may be 0, 1, 2 or 4 bytes long, depending on the addressing mode and execution mode (16-bit/32-bit/64-bit).

The last one is always the immediate data, if any. It can be also 0, 1, 2 or 4 bytes long.

So now, when you know the overall format of x86 instructions, you just need to know what are the encodings for all those bytes. And there are some patterns, contrary to common beliefs.

For example, all register encodings follow a neat pattern ACDB. That is, for 8-bit instructions, the lowest two bits of the register code encode the A, C, D and B registers, correspondingly:

00 = A register (accumulator)
01 = C register (counter)
10 = D register (data)
11 = B register (base)

I suspect that their 8-bit processors used just these four 8-bit registers encoded this way:

       second
      +---+---+
f     | 0 | 1 |          00 = A
i +---+---+---+          01 = C
r | 0 | A : C |          10 = D
s +---+ - + - +          11 = B
t | 1 | D : B |
  +---+---+---+

Then, on 16-bit processors, they doubled this bank of registers and added one more bit in the register encoding to choose the bank, this way:

       second                second         0 00  =  AL
      +----+----+           +----+----+     0 01  =  CL
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DL
i +---+----+----+     i +---+----+----+     0 11  =  BL
r | 0 | AL : CL |     r | 0 | AH : CH |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  AH
t | 1 | DL : BL |     t | 1 | DH : BH |     1 01  =  CH
  +---+---+-----+       +---+----+----+     1 10  =  DH
    0 = BANK L              1 = BANK H      1 11  =  BH

But now you can also choose to use both halves of these registers together, as full 16-bit registers. This is done by the last bit of the opcode (the least significant bit, the right-most one): if it's 0, this is an 8-bit instruction. But if this bit is set (that is, the opcode is an odd number), this is a 16-bit instruction. In this mode, the two bits encode one of the ACDB registers, as before. The patterns stays the same. But they encode full 16-bit registers now. But when the third byte (the highest one) is also set, they switch to a whole another bank of registers, called index/pointer registers, which are: SP (stack pointer), BP (base pointer), SI (source index), DI (destination/data index). So the addressing is now as follows:

       second                second         0 00  =  AX
      +----+----+           +----+----+     0 01  =  CX
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DX
i +---+----+----+     i +---+----+----+     0 11  =  BX
r | 0 | AX : CX |     r | 0 | SP : BP |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  SP
t | 1 | DX : BX |     t | 1 | SI : DI |     1 01  =  BP
  +---+----+----+       +---+----+----+     1 10  =  SI
    0 = BANK OF           1 = BANK OF       1 11  =  DI
  GENERAL-PURPOSE        POINTER/INDEX
     REGISTERS             REGISTERS

When introducing 32-bit CPUs, they doubled these banks again. But the pattern stays the same. Just now the odd opcodes mean the 32-bit registers and the even opcodes, as before, 8-bit registers. I'd call the odd opcodes the "long" versions, because the 16/32-bit version is used depending on the CPU and its current mode of operation. When it operates in 16-bit mode, the odd ("long") opcodes mean 16-bit registers, but when it operates in 32-bit mode, the odd ("long") opcodes mean 32-bit registers. It can be flipped around by prefixing the whole instruction with the 66 prefix (operand size override). The even opcodes (the "short" ones) are always 8-bit. So in 32-bit CPU, the register codes are:

0 00 = EAX      1 00 = ESP
0 01 = ECX      1 01 = EBP
0 10 = EDX      1 10 = ESI
0 11 = EBX      1 11 = EDI

As you can see, the ACDB pattern stays the same. Also the SP,BP,SI,SI pattern stays the same. It just uses the longer versions of the registers.

There are also some patterns in the opcodes. One of them I've described already (the even vs. odd = 8-bit "short" vs. 16/32-bit "long" stuff). More of them you can see in this opcode map I've made once for quick referencing and hand-assembling/disassembling stuff:

Finding number of operands in an instruction from opcodes

(It's not a full table yet, some of the opcodes are missing. Maybe I'll update it someday.)

As you can see, arithmetic & logic instructions are mostly located in the upper half of the table, and the left & right halves of it follow a similar layout. Data moving instructions are at the lower half. All branching instructions (conditional jumps) are in row 7*. There's also one full row B* reserved for mov instruction, which is a shorthand for loading immediate values (constants) into registers. They're all one-byte opcodes immediately followed by the immediate constant, because they encode the destination register in the opcode (they're chosen by the column number in this table), in its three least significant bytes (right-most ones). They follow the same pattern for register encoding. And the fourth bit is the "short"/"long" choosing one. You can see that your imul instruction is alreay in the table, exactly at the 69 position (huh.. ;J).

For many instructions, the bit just before the "short/long" bit, is to encode the order of operands: which one of the two registers encoded in the ModR/M byte is the source, and which one is the destination (this applies to the instructions with two register operands).

As to the ModR/M byte's addressing mode field, here's how to interpret it:

  • 11 is the simplest: it encodes register-to-register transfers. One register is encoded by the three next bits (the reg field), and the other register by the other three bits (the R/M field) of this byte.
  • 01 means that after this byte, a one-byte displacement will be present.
  • 10 means the same, but the displacement used is four-byte (on 32-bit CPUs).
  • 00 is the trickiest: it means indirect addressing or a simple displacement, depending on the contents of the R/M field.

If the SIB byte is present, it is signaled by the 100 bit pattern in the R/M bits. There's also a code 101 for 32-bit displacement-only mode, which doesn't use the SIB byte at all.

Here's a summary of all these addressing modes:

Mod R/M
 11 rrr = register-register  (one encoded in `R/M` bits, the other one in `reg` bits).
 00 rrr = [ register ]       (except SP and BP, which are encoded in `SIB` byte)
 00 100 = SIB byte present
 00 101 = 32-bit displacement only (no `SIB` byte required)
 01 rrr = [ rrr + disp8 ]    (8-bit displacement after the `ModR/M` byte)
 01 100 = SIB + disp8
 10 rrr = [ rrr + disp32 ]   (except SP, which means that the `SIB` byte is used)
 10 100 = SIB + disp32

So let's now decode your imul:

69 is its opcode. It encodes the imul's version which doesn't sign-extend the 8-bit operands. The 6B version does sign-extend them. (They differ by the bit 1 in the opcode if anyone asked.)

62 is the RegR/M byte. In binary it is 0110 0010 or 01 100 010. First two bytes (the Mod field) mean the indirect addressing mode, and that the displacement will be 8-bit. The next three bits (the reg field) are 100 and encode the SP register (in this case ESP, since we're in 32-bit mode) as the destination register. The last three bits are the R/M field and we have 010 there, which encode the D register (in this case EDX) as the other (source) register used.

Now we expect an 8-bit displacement. And there it is: 2f is the displacement, a positive one (+47 in decimal).

The last part is four bytes of the immediate constant, which is required by the imul instruction. In your case this is 6c 64 2d 6c which in little-endian is $6c2d646c.

And that's the way the cookie crumbles ;-J


The manuals do describe how to differentiate between one, two, or three operand versions.

Finding number of operands in an instruction from opcodes

F6/F7: one operand; 0F AF: two operands; 6B/69: three operands.


Some advice, first get all the instruction set docs you can get your hands on. for this x86 case try for some old 8088/86 manuals as well as more recent, from intel as well as the wealth of opcode tables on the net. various interpretation and documentation might first have subtle documentation errors or differences, and second some folks may present the info in a different and more understandable way.

Second, if this is your first disassembler I recommend avoiding x86, it is very hard. As your question implies variable word length instruction sets are difficult, to make a remotely successful disassembler, you need to follow the code in execution order, not memory order. So your disassembler has to use some sort of scheme to not only decode and print instructions but decode jump instructions and tag destination addresses as entry points into an instruction. for example ARM, is fixed instruction length, you can write an ARM disassembler that starts at the beginning of ram and disassembles each word straight through (assuming of course it is not a mixture of arm and thumb code). thumb (not thumb2) can be disassembled this way as there is only one flavor of 32 bit instruction, everything else is 16 bit, and that one flavor can be handled in a simple state machine as those two 16 bit instructions show up as pairs.

You are not going to be able to disassemble everything (with a variable length instruction set) and due to nuances of some hand coding or intentional tactics to prevent disassembly your up front code that walks the code in execution order may have what I would call a collision, for example your instructions above. Say that one path takes you to 0x69 being the entry point in to the instruction and you determine from that that is a 7 byte instruction, but say somewhere else there is a branch instruction whose destination computes as 0x2f being the opcode for an instruction, although very clever programming may pull something like that off, it is more likely that the disassembler has been lead to disassemble data. for example

clear condition flag
branch if condition flag clear
data

The disassembler wont know the data is data, and without additional smarts the disassembler wont realize that the conditional branch is in fact an unconditional branch (there could be many instructions on different branch paths between the condition clear and branch if condition clear) so it assumes the byte after the conditional branch is an instruction.

lastly I applaud your efforts, I often preach writing simple disassemblers (ones that assume the code is very short, intentionally crafted code) to learn an instruction set very well. If you dont put the disassembler into a situation where it has to follow in execution order and instead it can go in memory order (basically do not embed data between instructions, put it at the end or somewhere else leaving only strings of instructions to be disassembled). understanding the opcode decoding for an instruction set can make you much better at programming for that platform both for low level and high level languages.

short answer, intel used to publish, and maybe still does, technical reference manuals for the processors, I still have my 8088/86 manuals, a hardware one for the electrical stuff, and a software one for the instruction set and how it works. I have a 486 and probably a 386 one. The snapshot in Igor's answer directly resembles an intel manual. Because the instruction set has evolved so much over time makes x86 a difficult beast at best. At the same time, if the processor itself can wade through these bytes and execute them, you can write a program that can do the same thing but decode them. the difference being you are likely not going to make a simulator and any branches that are computed by the code and not explicit in the code you will not be able to see and the destination for that branch may not show up in your list of bytes to disassemble.


That is not a machine code instruction (which would consist of an opcode and zero or more operands).

That is part of a text string, it translates as:

$ echo -e "\x69\x62\x2f\x6c\x64\x2d\x6c"
ib/ld-l

which obviously is part of the string "/lib/ld-linux.so.2".


If you don't feeling like shifting through opcode tables/manuals, it always helps to learn from other's projects, like the open source disassembler, bea-engine, you might find that you don't even need to create your own one anymore, depending on what your doing it for.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号