This is a guest entry written by Joxean Koret from Activision. His views and opinions are his own and not those of Hex-Rays. Any technical or maintenance issues regarding the code herein should be directed to the author.
Diaphora: The most advanced Free and Open Source Binary Diffing Tool
Diaphora is an Open Source IDA plugin for doing binary diffing (usually called bindiffing, for short). In a nutshell, binary diffing is a reverse engineering technique used to find either the similarities or the differences between various pieces of software, in binary form. The technique was most likely invented by Thomas Dullien (Halvar Flake), author of the very first publicly available bindiffing tool called BinDiff.
I published this Open Source project, Diaphora, in 2015 and I have been testing and updating for every single minor IDA version since these times, which means from version 6.6 to the current 8.2 (as of the time of writing this blog post).
In this blog post I will discuss, briefly, how Diaphora works and, more in depth, show example usages. Let’s start…
This is how bindiffing works, in general: Two or more binaries are analysed and features about each function found in the binary are extracted. Then, the extracted functions are matched, using a set of heuristics, and compared, using some comparison function, to determine how close or different they are.
This is a brief list of some example
features that can be extracted from functions:
- A "hash" for the CFG (Control Flow Graph).
- The literal constants (strings, non common numbers).
- The RVA (Relative Virtual Address) and size in bytes.
Some example heuristics used by Diaphora to compare 2 functions can be the following:
- The functions are big enough and their MD5 are the same.
- The functions are big enough and the flow graphs are the same.
- The pseudo-code for both big enough functions are the same.
The tool’s GUI
When we execute the
diaphora.py script it shows the following dialogue:
Here we can select the SQLite database to export the current database, a secondary SQLite database previously exported to diff against it, the memory addresses to limit what should be exported, as well as enable or disable many different options, like if we want to use the decompiler, what do we want to export, which heuristics we want to use, what do we want to exclude, etc.
How Diaphora works
Diaphora, as pretty much any other binary diffing tool, works this way:
- First, we export the databases (the binaries) that we want to compare.
- Then, we diff both generated databases to find matches between them.
- Optionally, we can import matches from one binary to another.
In short, we have to export our binaries, then diff the binaries and, optionally, import everything from one database to the other.
Binary diffing use cases
The most common binary diffing use cases (ie, why reverse engineers use bindiffing tools for our day-day to job) are the following ones:
- Patch diffing. A binary or set of binaries have been patched (for example, to fix some vulnerability) and reverse engineers diff both binaries in order to see which changes were made in the old, unpatched, version comparing it with the latest version.
- Porting work. A reverse engineer works in some binary version for some time and, later on, the vendor publishes a new version of the binary. The reverse engineer diffs the old and new versions to see what has been modified, added, removed, etc… and also to port their work: comments, function names, enums, structs, etc…
- Importing symbols from static libraries. A reverse engineer starts their work with some binary and notices it uses some specific Open Source library that is statically linked in the binary. The reverse engineer compiles or downloads a binary version of the Open Source library with full symbols, diffs against the binary that embeds this library and then imports the matches so they don’t need to waste their time reverse engineering Open Source software, and also gets enums, structs, function names and prototypes imported in the binary.
Let’s see examples with Diaphora of some of the previously mentioned potential use-cases:
- Patch diffing. We will diff CVE-2020-0674, a vulnerability patched in the Microsoft’s JScript engine.
- Importing symbols from static libraries. We will compile the source code of one version of the SQLite3 engine, diff against a binary embedding it (
sqldifffrom some Ubuntu version) and then import the matches.
Patch Diffing CVE-2020-0674
According to Mitre, CVE-2020-0674 was a remote code execution vulnerability in the way that the scripting engine handled objects in memory in Internet Explorer, aka ‘Scripting Engine Memory Corruption Vulnerability’. We will work with 2
jscript.dll binaries with the following SHA256 hashes:
Now, let’s load in IDA the first binary (408cb1604d003f38715833a48485b6a4e620edf163fb59aef792595866e4796b), let the auto-analysis finish and then run
diaphora.py from within IDA, leave all options by default and click "OK"; Diaphora will start to export all functions, structs, enums, comments, etc… from the binary and store it in one SQLite database (which will be named by default
When Diaphora finishes exporting, close the database and open the next binary, c115d15807b96dcb9871ebc69618ef77473f1451c427e7349f9aa3c72891ddc2. As before, first let IDA perform initial auto-analysis, run again the script
diaphora.py and, this time, select in the 2 field shown in the dialogue the previous SQLite database that we exported (remember that Diaphora works with SQLite database, not directly with IDA databases) as shown in the picture below:
And, then, leave everything by default and press "OK". Diaphora will export the current binary and as soon as it finishes doing so it will start the diffing process. It will show a dialogue that will be updated from time to time telling us which heuristic is being executed:
After a while, Diaphora will finish finding matches and then it will show a set of choosers (IDA dockable list windows) showing the "Best", "Partial" and "Unmatched" functions:
In the "Best matches" tab we have all the functions that Diaphora matched and found no relevant change whatsoever. In the "Partial matches" tab we have all the functions the Diaphora matched but changes were made between the 2 binaries. There are also 2 other tabs: "Unmatched in primary" and "Unmatched in secondary". These tabs show those remaining functions from both binaries for which Diaphora found no appropriate match.
We will focus on the "Partial matches" tab, which is the one that shows us what was changed between the 2 binaries. Let’s select the function
GcAlloc::SetMark and then right click over it and select from the popup menu the option "Diff pseudo-code":
As we can see here only a single character seems to be changed: instead of checking if
GcContext::IsLegacyGCEnabled() returns true it now does the opposite. It seems that with this patch they are deprecating the "legacy garbage collector" feature. We can also diff the assembly if we want by doing right click over the the select function match and then selecting "Diff assembly" from the popup menu; it will show the following:
As expected, a conditional jump was changed (from
jnz). Now, let’s take a look to another function in the partial matches set,
GcContext::InitIsLegacyGCEnabled(). This time, instead of choosing to diff assembly or pseudo-code we will select from the popup menu the option "Show pseudo-code patch":
As we can see, Microsoft changed some registry key to enable/disable the legacy garbage collector in the JScript engine. If we were interested in just how Microsoft mitigated or disabled this feature, we are done. The patch is a bit more complex than that and it involves how JScript variables are "scavenged", but it’s out of the scope of this blog post showing how to use Diaphora.
Importing symbols from static libraries
Let’s see another common usage of binary diffing tools: importing symbols (function names, enums and structs) from Open Source libraries that were statically linked into some binary. For this example (and for legal reasons) we will use the following binaries:
- Our own compiled version of the SQLite3 engine with symbols.
sqldiffbinary from Ubuntu that was statically linked with
We will start by compiling SQLite3: download the sources amalgamation from their website, and simply compile it like this:
$ gcc -O2 shell.c sqlite3.c -g -o sqlite3
Then, open the resulting binary in IDA, let the auto-analysis finish and when it’s done run the script
diaphora.py, leave all options by default and press OK. It will take sometime because SQLite3 is a big project, even when it’s an embeddable engine. When Diaphora finishes exporting everything from the
sqlite3 binary to the corresponding
sqlite3.sqlite Diaphora database, close the binary and now load
As always, let’s IDA finish its initial work and when it’s done run again
diaphora.py, and in the 2nd field select the
sqlite3.sqlite database that we exported before and just press "OK". After some time, 5 minutes in my testing machine, it will finish exporting & diffing and will show the list of matched in chooser windows (or tabs, if you prefer).
In this example we have 237 functions that were matched by Diaphora with a similarity ratio of 1.0, which means that these functions were not changed. If we go to the partial matches we will see that we have almost ~1,000 functions matched:
OK, so we have some (initial) good results, let’s start importing matches so we can, later on, work on the
sqldiff binary without having to reverse engineer whatever was in
sqlite3.c: go to the "Best matches" tab, right click on the chooser and select from the popup menu the option "Import all data for sub_* functions" (this option will import everything that is in the
sqlite3 binary for function matches starting with the IDA’s auto-generated prefix "sub_"). When asked by Diaphora with the following dialogue just press "Yes":
Diaphora will start importing structs, enums, comments in the pseudo-code and assembly (if any), type libraries, etc… and after some time it will show something like the following:
We have some initial matches and local types to start working on, however, we can make it even better by importing some (or all) the partial matches results so, let’s to this tab and select all results that have, at least, a similarity ratio of 0.600 (the value from which I have no doubts taking a brief look to some matches that they are reliable ones) and then right click on the chooser and select from the popup menu the option "Import selected sub_*":
When asked if we want to import press "Yes" and it will start importing everything related to the new functions (like global variables referenced by them, function comments, prototypes and names) and, after a while, it will finish doing so. And now, as you can see in the picture below, we will have many more functions renamed in our IDA database and we could start working already on this binary without having to waste our time reverse engineering the embedded SQLite3 database:
In this guest post we have shown only the tip of the iceberg, just some of the most basic features of Diaphora, and we excluded many other features, like scripting, automation, adding user-defined heuristics, etc… because otherwise it would be a too big blog post but, hopefully what we described will help you in your current and future reverse engineering projects and, if you have any question or doubt, you can contact the Diaphora author, Joxean Koret, by opening an issue in github or sending an e-mail to admin @joxeakoret.com.
PS: The screenshots in this blog post were taken with a currently not yet published development version of Diaphora (what will be the 3.0 version of this project). However, basically everything but the number of function matches should be the same as one could get using the current public version of Diaphora.