If you used IDA Pro for a while, you might have noted that it contents itself
with simple things. It neatly displays the disassembly listing.
It allows you to improve the listing by adding names and comments. You can manually define
your symbols, types, functions. IDA itself can add some types and discover some
program properties, but
overall the performed analyses appear to be rather modest.
On the other hand things seem to be simple.
After spending some time contemplating these always-the-same code patterns
one gets used to them. Many binary analysis problems start to look elementary, if not boring.
One could even think: it is so obvious, why not automate it? Let the computer
find things for us. Questions like where will this register be used?
or where does that value come from? could be answered by the computer!
Sure, there will be situations when it fails, but at least in other cases the
answers could be calculated without any human intervention!…
When we implement the first naive approach, it turns out to work with our test program.
Then we switch to another program and get our first surprise: it does not work at all!
Switching to another file compiled
by another compiler or even with the same compiler but with a different set of compiler options spoils everything.
The patterns which worked for the first compiler do not work and the analysis fails.
The next step could be to add patterns for the second compiler. Then for the third.
If we take this approach, our analysis tool would never be complete.
Imagine we want to find where a function expects its input arguments. This question arises
very often during binary analysis and answering it correctly is very important.
Without knowing where to look for the input we can never find out what the input values are.
For example, this function
push ebp mov ebp, esp movsx eax, [ebp+8] ... |
apparently expects one of its arguments on the stack because it copies
something above the return address to the eax register.
Another example. This function
push ebp mov ebp, esp call [ecx+10h] ... |
uses ecx without initializing it. Thus ecx must already
have an initialized value,
otherwise the application would crash. So the rule might be
INPUT ARGUMENTS RULEIf the function uses a register or stack location without initializing it, |
We have not defined what the word uses means.
Let’s say that a direct read access constitutes a use. Usually arguments are
accessed directly so the requirement of a direct access does not limit out
analysis1.
We ‘just’ need to perform the data analysis and find all these registers.
Unfortunately, this does not work at all. Here is a counterexample
push ebp mov ebp, esp push esi ... pop esi pop ebp ret |
This push instruction accesses the esi register for read.
But we all know that a push/pop
pattern like this represents the ‘save-restore’ idiom. The register is not really used.
We do not care about its value, we just save it on the stack and restore it at the exit.
For any experienced IDA user all this is pretty obvious.
We have to amend our rule about the function
inputs. It could look like this:
INPUT ARGUMENTS RULE VERSION 2If the function uses a register or stack location without initializing it, |
My modifications to the rule are very vague. I did so for a reason: the problem of finding
preserved registers is not simple at all!
We will build an algorithm to find preserved registers.
In our first try at it we look for the push/pop pairs:
PRESERVED REGISTERS RULE VERSION 1A register is preserved by the function if there is a pop |
This rule works perfectly well with this function:
push edi push esi ... pop esi pop edi ret |
Alas, a counterexample is obvious:
push edi push esi ... ret1: pop esi pop edi ret ... ret2: pop esi pop edi ret |
Strictly speaking we do not have pairs here. There are several pops for each push.
The second version of the rule:
PRESERVED REGISTERS RULE VERSION 2A register is preserved by the function if for each return |
I’m sure you realize that this rule is too naive… Let’s check this this code (generated
by GNU C++):
push ebp mov ebp, esp sub esp, 18h mov [ebp-8], ebx mov [ebp-4], esi ... mov ebx, [ebp-8] mov esi, [ebp-4] mov esp, ebp pop ebp retn |
So the push instruction is not the only means of preserving registers.
This example effectively destroys our rule beyond any hope. It was built around the
idea that a function would push registers onto the stack, do its job, then pop them from
the stack. When the push/pop instructions were replaced by movs,
we have to abandon this approach. Here are two more examples.
The first one:
push ebp mov ebp, esp push esi mov [ebp-4], edi ... mov edi, [ebp-4] mov esp, ebp pop ebp retn |
Please note that esi is not saved by this
function2!
Here is the second one:
push ebp mov ebp, esp sub esp, 20h ... push esi ... pop esi ... mov esp, ebp pop ebp retn |
As you see, a register is saved in the middle of the function.
To summarize:
I used microcode to overcome the instruction variety. All x386 instructions,
be it push, mov, or pusha, are converted to the microcode data transfer instruction.
Data flow analysis helps to find these instruction anywhere in the function body.
Here is the outline of the data flow analysis:
There were many minor problems but they are too boring to delve into details.
To make the story short, I’ll just show the final version of the rule:
PRESERVED REGISTERS – FINAL VERSIONa register is preserved by the function, iff:
|
This rule is still not perfect (indirect accesses are not taken into account), but overall it
works well.
Please note that we considered only compiler generated code. Hand crafted code is
almost always impervious to automated analysis. Fortunately there is less and less
assembler code.
I hope this small example illustrates my point: binary analysis problems are much more difficult
than they look. Simple approaches simply do not work.
Even if we strictly limit ourselves to compiler generated code, they still fail.
They will work if we limit ourselves to a specific version of a specific
compiler or even to a specific binary.
So crafting a naive script to solve
today’s problem might be the right thing to do. If not, expect monstrous rules like the above
to haunt your logic…
P.S. The gcc compiler source code is abundant with lengthy and complex rules; does it
mean that we have to live with that? 😉
1.
There are many exceptions to this rule, just one of them:
the trailing arguments of variadic functions are always accessed indirectly; but since
all these arguments are anonymous, we do not lose anything. In fact there are many minor
details which are irrelevant here.
2.
If you wonder which compiler generates such a code, it is ICC, Intel’s compiler.
3.
The iterative approach to the data flow analysis is explained in this book:
Muchnick, Steven S. Advanced Compiler Design and Implementation