Sometimes unexpected detours are necessary to reach the goal. Take this simple
assembly code:
This compiler generated code calculates the length of the input string. If you do not remember
the exact definition of repne scasb, here is another snippet which does the same thing:
A straightforward decompilation of the first snippet yields this:
I can’t say that the C code is any better than the assembly code:
- Single repne scasb has been replaced by an obscure loop.
- An additional variable to represent the ZF flag has been introduced.
- The result is longer than the initial assembly code.
It would be nice if the decompiler could replace this assembly code
by a call to strlen. For a human reader, the difference would be spectacular:
Just one meaningful line, no puzzling x86 instructions,
just plain and understandable code!
Now, the question is, how do I transform the initial assembly code into this ideal
decompilation result? I could hardcode the decompiler to check if the first instruction
is mov, the second is xor, and so on. You know better than me that this naive approach
is severely limited: as soon as the compiler decides to shuffle instructions, use different
registers, or replace repne scasb with a loop, our decompiler would be hopelessly
confused and lost. Also, different compilers generate different code for built-in functions
(just remember the second strlen example).
I can not hope to hardcode all these variations by hand! What if I could specify
the sequence in an abstract form and match it against real assembly code?
This idea looked attractive for me: I just need to build the pattern matcher once
and specify patterns for built-in functions. Patterns could look like this:
- x86 instructions are gone – they have been replaced by abstract instructions for a virtual machine.
- Registers are gone – they have been replaced by abstract variable names.
Difficulties are not where we expect them – the most laborious part of the task turned out to be
the pattern reader utility which would read the above text representation and produce
something binary. And here I stopped and asked myself: what binary representation do I need?
The answer was surprising: the pattern reader would generate a C text! The main reason
is that C text is most portable, you just need to compile it. I could generate a binary
file but then I would need to design its format. I could generate another text file but then
I would need another reader. C code has a reader – a C compiler, it can also have any
format I want with the structure and union declarations.
The path to the result turned out to be not as straight as I hoped:
The decompiler would be based on a utility which generates C code
from an assembler for a virtual machine. Everything got mixed up.