## About the Tool
<em>Mnemocrypt</em> is a random forest classifier-based tool for the detection and partial identification of cryptographic functions in x86 executables. The machine learning model bases its predictions on general metrics related to the structure of functions, as well as on statistics related to their content with different levels of granularity. These statistics are essentially derived from the mnemonics of assembly instructions. Mnemocrypt can be considered a generalization of Caballero heuristics-based approaches and incorporates some of their principles. The Mnemocrypt IDA plugin provides partial cryptographic identification information (via the <em>Identification Tag</em> column) by leveraging mnemonics belonging to specific cryptographic instruction extension sets. The tool has been tested on IDA Pro 9.0 with Python 3.9.2.

## How to use Mnemocrypt plugin
- Move `mnemocrypt.py`, `mnemocrypt_trained.pkl` and `mnemocrypt_roots.json` to you IDA plugins directory.
- Run the plugin on executables of your choice (Ctrl-Shift-M or select "Mnemocrypt" in the list of plugins)

## Training Set
The training set is composed of 32-bit binaries statically linked with OpenSSL and Libsodium, compiled with debugging symbols (to enable the reading of function names) using Clang, GCC, and MSVC compilers across all their main optimization levels: O0, O1, O2, O3, Ofast, Os, and Oz for Clang; O0, O1, O2, O3, Ofast, and Os for GCC; and Od, O1, O2, and Ox for MSVC. This variety allows the model to learn variations in mnemonics for the same functions compiled with different configurations. The functions in this set of binaries have been manually labeled as cryptographic or non-cryptographic, with the guiding principle of minimizing potential false positives in complex cases.

## Pre-trained Model
The model `mnemocrypt_trained.pkl` has been trained with 1000 decision trees and default depth, using SMOTE (Synthetic Minority Oversampling Technique) to address the imbalanced nature of the training data (as there are significantly fewer cryptographic functions than non-cryptographic ones).

## Features Used by Mnemocrypt
The tool bases its predictions on statistics over semantics-based categories of mnemonics, as well as on the roots of mnemonics (i.e., the most semantically meaningful parts of mnemonics, as many mnemonic variants serve very similar purposes) used in Caballero heuristics. Other cryptography-relevant features, such as the number of data references or function calls, are also considered. Users can list all features along with their respective weights through the trained model to better understand the basis of Mnemocrypt's decisions. Information about categories and their respective roots with variants are stored in `mnemocrypt_roots.json` and is used during features computation.

## Example of output of Mnemocrypt plugin in IDA GUI
![image info](example_output.png)

- Coloring convention:
  - yellow: confidence score 0.5-0.75
  - orange: confidence score 0.75-0.95
  - red: confidence score 0.95-1.0

- Higher the confidence score is and more likely, according to Mnemocrypt, a given function is to perform cryptographic operations.

- Minimal confidence score to show from and coloring convention can be changed in the plugin script <em>mnemocrypt.py</em>

- Most frequent kinds of false postiives with high confidence score (greater than 0.9): compression or encoding related functions as well as functions performing some complex, not cryptography related, computations or data processing.

## Notes
- Mnemocrypt can take considerable time for processing large files (the complexity is globally linear in size of processed binaries).
- Sometimes in IDA output the message about end of analysis is shown after the one indicating the end of Mnemocrypt execution. However, Mnemocrypt calls `idaapi.auto_wait()` to ensure the end of autoanalysis before it starts.
