Latest available version: IDA and decompilers v8.4.240527sp2 see all releases
Hex-Rays logo State-of-the-art binary code analysis tools
email icon

One of the new features we are preparing for the next version of IDA is the ability to write processor modules using your favorite scripting language.
After realizing how handy it is to write file loaders using scripting languages, we set out to making the same thing for processor modules. As an exercise for this new feature, we implemented a processor module for the EFI bytecode.


Background

In IDA Pro, a processor module implementation is usually split into four parts:

  1. Processor, assembler, instructions and registers definitions (ins.cpp/.hpp, reg.cpp)
  2. Decoder (ana.cpp): decodes an instruction into an insn_t structure (the ‘cmd’ global variable)
  3. Emulation (emu.cpp): emulates instructions, creates appropriate cross references, traces the stack, recognizes code patterns, etc…
  4. Output (out.cpp): outputs the result to the screen

The processor module is described using the processor_t structure. It holds pointers to registers, instructions, processor module name and other callbacks (ana, emu, out, notify, …).
The assembler is described using the asm_t structure. It holds pointers to the assembler syntax and other callbacks.

Writing a processor module in Python

To write a processor module in Python, we follow similar logic.

  1. Write the get_idp_desc() function. It simply tells IDA what processors the module can handle.
    def get_idp_desc():
        return "EFI Byte code:ebc"
    

    The return value means that this processor is named “EFI Byte code” and its shortname is “ebc”. Thus a subsequent call to set_processor_type(‘ebc’) from the part of a file loader will succeed.

    In case of the pc processor module, which can handle many variations of x86 architecture, the string looks like this:

    Intel 80x86 processors:8086:80286r:80286p:80386r:80386p:...
  2. Define the registers and instructions:
    # Registers definition
      proc_Registers = [
          # General purpose registers
          "R0",
          "R1",
          ...,
          "R10",
          ...
      ]
      # Instructions definition
      proc_Instructions = [
          {'name': 'INSN1', 'feature': CF_USE1},
          {'name': 'INSN2', 'feature': CF_USE1 | CF_CHG1}
          ...
      ]
    
  3. Write the get_idp_def() function. It should return a dictionary similar to the processor_t structure with the processor, assembler, instructions and registers definitions.
    # This function returns the processor module definition
    def get_idp_def():
        return {
            'version': IDP_INTERFACE_VERSION,
            # IDP id
            'id' : 0x8000 + 1,
            # Processor features
            'flag' : PR_USE32 | PRN_HEX | PR_RNAMESOK,
            # short processor names
            # Each name should be shorter than 9 characters
            'psnames': ['ebc'],
            # long processor names
            # No restriction on name lengthes.
            'plnames': ['EFI Byte code'],
            # number of registers
            'regsNum': len(proc_Registers),
            # register names
            'regNames': proc_Registers,
            # Array of instructions
            'instruc': proc_Instructions,
            ....
            'assembler': \
            {
                    # flag
                    'flag' : ASH_HEXF3 | AS_UNEQU | AS_COLON | ASB_BINF4 | AS_N2CHR,
                    # Assembler name (displayed in menus)
                    'name': "EFI bytecode assembler",
                    ...
                    # byte directive
                    'a_byte': "db",
                    # word directive
                    'a_word': "dw",
                    # remove if not allowed
                    'a_dword': "dd",
                    ...
            } # Assembler
        }
    

Now that we finished all the declarations, we can implement the decoder (or analyzer), emulator and the output callbacks.

  • The analyzer looks like this:
    def ph_ana():
        """
        Decodes an instruction into the global variable 'cmd'
        Current address is pre-filled in cmd.ea
        """
        cmd = idaapi.cmd
        # take opcode byte
        b = ua_next_byte()
        # decode and fill cmd.Operands etc...
        # ...
        # Return decoded instruction size or zero
        return cmd.size
    

    And decoding one instruction/filling the ‘cmd’ variable may look like this:

    def decode_JMP8(opbyte, cmd):
        conditional   = (opbyte & 0x80) != 0
        cs            = (opbyte & 0x40) != 0
        cmd.Op1.type  = o_near
        cmd.Op1.dtyp  = dt_byte
        addr          = ua_next_byte()
        cmd.Op1.addr  = (as_signed(addr, 8) * 2) + cmd.size + cmd.ea
        if conditional:
            cmd.auxpref = FL_CS if cs else FL_NCS
        return True
    
  • The emulator:
    # Emulate instruction, create cross-references, plan to analyze
    # subsequent instructions, modify flags etc. Upon entrance to this function
    # all information about the instruction is in 'cmd' structure.
    # If zero is returned, the kernel will delete the instruction.
    def ph_emu():
        aux = cmd.auxpref
        Feature = cmd.get_canon_feature()
        if Feature & CF_USE1:
            handle_operand(cmd.Op1, 1)
        if Feature & CF_CHG1:
            handle_operand(cmd.Op1, 0)
        if Feature & CF_USE2:
            handle_operand(cmd.Op2, 1)
        if Feature & CF_CHG2:
            handle_operand(cmd.Op2, 0)
        if Feature & CF_JUMP:
            QueueMark(Q_jumps, cmd.ea)
        # add flow xref
        if Feature & CF_STOP == 0:
            ua_add_cref(0, cmd.ea + cmd.size, fl_F)
        return 1
    
  • The output callback:
    # Generate text representation of an instruction in 'cmd' structure.
    # This function shouldn't change the database, flags or anything else.
    # All these actions should be performed only by ph_emu() function.
    def ph_out():
        cmd = idaapi.cmd
        # Init output buffer
        buf = idaapi.init_output_buffer(1024)
        # First, output the instruction mnemonic
        OutMnem()
        # Output the first operand if present (this invokes the ph_outop callback)
        out_one_operand( 0 )
        # Output the rest of the operands
        for i in xrange(1, 3):
            op = cmd[i]
            if op.type == o_void:
                break
            out_symbol(',')
            OutChar(' ')
            out_one_operand(i)
        # Terminate the output buffer
        term_output_buffer()
        # Emit the line
        cvar.gl_comm = 1
        MakeLine(buf)
    

    Note that the previous callbacks are very similar to their C language counterparts.

Although this feature will not work with the current version of IDA Pro, you can download the EBC script sample for a preview of how a module would look.

If you like this feature, make sure to apply for the beta testing of next version when we announce it!