Coordinate system for Hex-Rays

One of must-have features of a reverse engineering tool is the ability to add comments to the output listing. Without this feature, the output stays difficult to understand. The user copies it to a text editor to continue the analysis but this is a bad solution because the dynamic nature of the output is lost. The command to rename variables alleviates the problem but comments are still necessary.

At first sight implementing this feature is a piece of cake. Just remember the line number and the comment string, and you are done. For example:

The first 3 lines are commented. The comment information could be stored in a table similar to this:

Line #

Comment
1

cmt#1
2

cmt#2
3

#cmt#3

Unfortunately, this approach works only with a static text, not with highly dynamic text such as a decompiler output. The line numbers can change at any time. An innocent action, for example, renaming a variable can reformat the output:

I renamed v6 to my_favorite_ptr_with_fancy_name and the decompiler broke the too long line into 2 separate lines.

Since the line numbers keep changing, we can not use them.

We can not attach the comment information to the intermediate representation (IR) neither. The reason is that the IR can change as well. To illustrate my point, I’ll take this function:

Before showing how IR can change, I’ll improve the text (just to show you that the output is highly dynamic). I define the following structures:

Then I change the function prototype to use the defined strctures:

This is much better! All casts are gone, the output is clean.

Now imagine that the returned value is never used by the program. I change the function prototype to reflect this fact. The output code changes drastically:

Please note that the else clause has completely disappeared. All assignments to the result variable have disappeared too. In other words, the IR is different.

Reverse transformations, where the output becomes longer, are possible as well. Usually they happen when the user changes void return type to int or int to longlong.

Also, the future versions of the decompiler could introduce commands to transform the output text. This code:

if ( cond1 )
  if ( cond2 )
    actions;

sometimes looks better if represented as

if ( cond1 && cond2 )
    actions;

If the user can switch between equivalent forms of the output text (and he will be able to do so in the future), we can not use IR to attach comments.


My solution to the problem is simple. I decided to attach comments to the instruction addresses. It is possible to trace back every line of the decompiler output to the assembly instruction that generated it. Our sample function has been generated from this assembly:

Each line can be mapped to an address:

Unfortunately this mapping is not bijective. If we attach comments to addresses, we still need more information to locate the exact line. For example, the address 10002DEB has 3 lines corresponding to it.

To distinguish between several lines with the same address, I used the very last item of the
output line. Possible values are (){},;: and a few keywords that can be on a separate line such as do, else. As it turns out, this additional markup is enough to distinguish lines in almost all cases. Unfortunately there are still some unhandled situations (for example, rep movsb is represented with a multi-line loop) but overall the solution is good. The positive points are:

  • It does not depend on the intermediate representation.
  • It does not depend on the line numbers.
  • It is resistant to code transformations that swap statements (imagine negating an if condition and swapping its then and else branches) or even replace them (imagine replacing while with for).

However, some code transformations can change the output too much and some comments could become invisible. Currently this is what happens but I plan to introduce an option to display orphaned comments somewhere in the output text. Since everyone hates losing an entered comment, I’ll make sure that they are not lost even if the output text is totally unrecognizable after your modifications 😉