This is a guest entry written by Joxean Koret from Activision. His views and opinions are his own and not those of Hex-Rays. Any technical or maintenance issues regarding the code herein should be directed to the author.
Diaphora is an Open Source IDA plugin for doing binary diffing (usually called bindiffing, for short). In a nutshell, binary diffing is a reverse engineering technique used to find either the similarities or the differences between various pieces of software, in binary form. The technique was most likely invented by Thomas Dullien (Halvar Flake), author of the very first publicly available bindiffing tool called BinDiff.
I published this Open Source project, Diaphora, in 2015 and I have been testing and updating for every single minor IDA version since these times, which means from version 6.6 to the current 8.2 (as of the time of writing this blog post).
In this blog post I will discuss, briefly, how Diaphora works and, more in depth, show example usages. Let’s start…
This is how bindiffing works, in general: Two or more binaries are analysed and features about each function found in the binary are extracted. Then, the extracted functions are matched, using a set of heuristics, and compared, using some comparison function, to determine how close or different they are.
This is a brief list of some example features
that can be extracted from functions:
Some example heuristics used by Diaphora to compare 2 functions can be the following:
When we execute the diaphora.py
script it shows the following dialogue:
Here we can select the SQLite database to export the current database, a secondary SQLite database previously exported to diff against it, the memory addresses to limit what should be exported, as well as enable or disable many different options, like if we want to use the decompiler, what do we want to export, which heuristics we want to use, what do we want to exclude, etc.
Diaphora, as pretty much any other binary diffing tool, works this way:
In short, we have to export our binaries, then diff the binaries and, optionally, import everything from one database to the other.
The most common binary diffing use cases (ie, why reverse engineers use bindiffing tools for our day-day to job) are the following ones:
Let’s see examples with Diaphora of some of the previously mentioned potential use-cases:
sqldiff
from some Ubuntu version) and then import the matches.According to Mitre, CVE-2020-0674 was a remote code execution vulnerability in the way that the scripting engine handled objects in memory in Internet Explorer, aka ‘Scripting Engine Memory Corruption Vulnerability’. We will work with 2 jscript.dll
binaries with the following SHA256 hashes:
Now, let’s load in IDA the first binary (408cb1604d003f38715833a48485b6a4e620edf163fb59aef792595866e4796b), let the auto-analysis finish and then run diaphora.py
from within IDA, leave all options by default and click "OK"; Diaphora will start to export all functions, structs, enums, comments, etc… from the binary and store it in one SQLite database (which will be named by default 408cb1604d003f38715833a48485b6a4e620edf163fb59aef792595866e4796b.sqlite
).
When Diaphora finishes exporting, close the database and open the next binary, c115d15807b96dcb9871ebc69618ef77473f1451c427e7349f9aa3c72891ddc2. As before, first let IDA perform initial auto-analysis, run again the script diaphora.py
and, this time, select in the 2 field shown in the dialogue the previous SQLite database that we exported (remember that Diaphora works with SQLite database, not directly with IDA databases) as shown in the picture below:
And, then, leave everything by default and press "OK". Diaphora will export the current binary and as soon as it finishes doing so it will start the diffing process. It will show a dialogue that will be updated from time to time telling us which heuristic is being executed:
After a while, Diaphora will finish finding matches and then it will show a set of choosers (IDA dockable list windows) showing the "Best", "Partial" and "Unmatched" functions:
In the "Best matches" tab we have all the functions that Diaphora matched and found no relevant change whatsoever. In the "Partial matches" tab we have all the functions the Diaphora matched but changes were made between the 2 binaries. There are also 2 other tabs: "Unmatched in primary" and "Unmatched in secondary". These tabs show those remaining functions from both binaries for which Diaphora found no appropriate match.
We will focus on the "Partial matches" tab, which is the one that shows us what was changed between the 2 binaries. Let’s select the function GcAlloc::SetMark
and then right click over it and select from the popup menu the option "Diff pseudo-code":
GcAlloc::SetMark
As we can see here only a single character seems to be changed: instead of checking if GcContext::IsLegacyGCEnabled()
returns true it now does the opposite. It seems that with this patch they are deprecating the "legacy garbage collector" feature. We can also diff the assembly if we want by doing right click over the the select function match and then selecting "Diff assembly" from the popup menu; it will show the following:
As expected, a conditional jump was changed (from jz
to jnz
). Now, let’s take a look to another function in the partial matches set, GcContext::InitIsLegacyGCEnabled()
. This time, instead of choosing to diff assembly or pseudo-code we will select from the popup menu the option "Show pseudo-code patch":
As we can see, Microsoft changed some registry key to enable/disable the legacy garbage collector in the JScript engine. If we were interested in just how Microsoft mitigated or disabled this feature, we are done. The patch is a bit more complex than that and it involves how JScript variables are "scavenged", but it’s out of the scope of this blog post showing how to use Diaphora.
Let’s see another common usage of binary diffing tools: importing symbols (function names, enums and structs) from Open Source libraries that were statically linked into some binary. For this example (and for legal reasons) we will use the following binaries:
sqldiff
binary from Ubuntu that was statically linked with sqlite3.c
.We will start by compiling SQLite3: download the sources amalgamation from their website, and simply compile it like this:
$ gcc -O2 shell.c sqlite3.c -g -o sqlite3
Then, open the resulting binary in IDA, let the auto-analysis finish and when it’s done run the script diaphora.py
, leave all options by default and press OK. It will take sometime because SQLite3 is a big project, even when it’s an embeddable engine. When Diaphora finishes exporting everything from the sqlite3
binary to the corresponding sqlite3.sqlite
Diaphora database, close the binary and now load sqldiff
.
As always, let’s IDA finish its initial work and when it’s done run again diaphora.py
, and in the 2nd field select the sqlite3.sqlite
database that we exported before and just press "OK". After some time, 5 minutes in my testing machine, it will finish exporting & diffing and will show the list of matched in chooser windows (or tabs, if you prefer).
In this example we have 237 functions that were matched by Diaphora with a similarity ratio of 1.0, which means that these functions were not changed. If we go to the partial matches we will see that we have almost ~1,000 functions matched:
OK, so we have some (initial) good results, let’s start importing matches so we can, later on, work on the sqldiff
binary without having to reverse engineer whatever was in sqlite3.c
: go to the "Best matches" tab, right click on the chooser and select from the popup menu the option "Import all data for sub_* functions" (this option will import everything that is in the sqlite3
binary for function matches starting with the IDA’s auto-generated prefix "sub_"). When asked by Diaphora with the following dialogue just press "Yes":
Diaphora will start importing structs, enums, comments in the pseudo-code and assembly (if any), type libraries, etc… and after some time it will show something like the following:
We have some initial matches and local types to start working on, however, we can make it even better by importing some (or all) the partial matches results so, let’s to this tab and select all results that have, at least, a similarity ratio of 0.600 (the value from which I have no doubts taking a brief look to some matches that they are reliable ones) and then right click on the chooser and select from the popup menu the option "Import selected sub_*":
When asked if we want to import press "Yes" and it will start importing everything related to the new functions (like global variables referenced by them, function comments, prototypes and names) and, after a while, it will finish doing so. And now, as you can see in the picture below, we will have many more functions renamed in our IDA database and we could start working already on this binary without having to waste our time reverse engineering the embedded SQLite3 database:
In this guest post we have shown only the tip of the iceberg, just some of the most basic features of Diaphora, and we excluded many other features, like scripting, automation, adding user-defined heuristics, etc… because otherwise it would be a too big blog post but, hopefully what we described will help you in your current and future reverse engineering projects and, if you have any question or doubt, you can contact the Diaphora author, Joxean Koret, by opening an issue in github or sending an e-mail to admin @joxeakoret.com.
Happy reversing!
PS: The screenshots in this blog post were taken with a currently not yet published development version of Diaphora (what will be the 3.0 version of this project). However, basically everything but the number of function matches should be the same as one could get using the current public version of Diaphora.