State-of-the-art binary code analysis tools

We’ve mentioned operand representation before but today we’ll use a specific one to find the Easter egg hidden in the post #85.

More specifically, it was this screenshot:

The function surprise calls printf, but the arguments being passed to it seem to all be numbers. Doesn’t printf() usually work with strings? What’s going on?

Numbers and characters

As you probably know, computers do not actually distinguish numbers from characters – to them they’re all just a set of bits. So it’s all a matter of interpretation or representation. For example, all of the following are represented by the same bit pattern:

  1. 65 (decimal number)
  2. 0x41, 41h, H'41 (hexadecimal number)
  3. 0101 or 101o (octal number)
  4. 1000001b or 0b1000001 (binary number)
  5. 'A' (ASCII character)
  6. WM_COMPACTING (Win32 API constant)
  7. (and many other variations)

Character operand representation

In fact, listing in the screenshot has been modified from the defaults to make the Easter egg less obvious. Here’s the original version as text:

.text:00401010 ; int surprise(...)
.text:00401010 _surprise proc near                     ; CODE XREF: _main↑p
.text:00401010
.text:00401010 var_24= dword ptr -24h
.text:00401010 var_20= dword ptr -20h
.text:00401010 _Format= byte ptr -1Ch
.text:00401010 var_18= dword ptr -18h
.text:00401010 var_14= dword ptr -14h
.text:00401010 var_10= dword ptr -10h
.text:00401010 var_C= dword ptr -0Ch
.text:00401010 var_8= dword ptr -8
.text:00401010 var_4= dword ptr -4
.text:00401010
.text:00401010 sub     esp, 24h
.text:00401013 mov     eax, ___security_cookie
.text:00401018 xor     eax, esp
.text:0040101A mov     [esp+24h+var_4], eax
.text:0040101E lea     eax, [esp+24h+var_24]
.text:00401021 mov     dword ptr [esp+24h+_Format], 70747468h
.text:00401029 push    eax
.text:0040102A lea     eax, [esp+28h+_Format]
.text:0040102E mov     [esp+28h+var_18], 2F2F3A73h
.text:00401036 push    eax                             ; _Format
.text:00401037 mov     [esp+2Ch+var_14], 2D786568h
.text:0040103F mov     [esp+2Ch+var_10], 73796172h
.text:00401047 mov     [esp+2Ch+var_C], 6D6F632Eh
.text:0040104F mov     [esp+2Ch+var_8], 73252Fh
.text:00401057 mov     [esp+2Ch+var_24], 74736165h
.text:0040105F mov     [esp+2Ch+var_20], 7265h
.text:00401067 call    _printf
.text:0040106C mov     ecx, [esp+2Ch+var_4]
.text:00401070 add     esp, 8
.text:00401073 xor     ecx, esp                        ; StackCookie
.text:00401075 xor     eax, eax
.text:00401077 call    @__security_check_cookie@4      ; __security_check_cookie(x)
.text:0040107C add     esp, 24h
.text:0040107F retn
.text:0040107F _surprise endp

In hexadecimal it’s almost immediately obvious: the “numbers” are actually short fragments of ASCII text. The code is building strings on the stack piece by piece. This can be made more explicit by converting numbers to the character operand type (shortcut R).

To help you decide whether such operand type makes sense, IDA shows a preview in the context menu:

This way it’s pretty clear that the “number” is actually a text fragment. After converting all “numbers” to character constant, a pattern begins to emerge:

Due to the little-endian memory organization of the x86 processor family, the individual fragments have to be read backwards (i.e. character literal 'ptth' corresponds to the string fragment "http").

Decompiler and optimized string operations

Now it’s almost obvious what the result is supposed to be but there’s in fact an even easier way to discover it.

Because the approach of processing short strings in register-sized chunks is often used by compilers to implement common C runtime functions inline instead of calling the library function, the decompiler uses heuristics to detect such code patterns and show them as equivalent function calls again. If we decompile this function, the decompiler reassembles the strings and shows them as if they were like that in the pseudocode:

Stack strings

Malware often uses a similar approach of building strings by small pieces (most often character by character) on the stack because this way the complete string does not appear in the binary and can’t be discovered by simply searching for it. Thanks to the automatic comments shown by IDA for operands not having explicitly assigned type, they are usually obvious in the disassembly:

And the decompiler easily recovers the complete string:

void __noreturn start()
{
  char v0[36]; // [esp+0h] [ebp-28h] BYREF  
  qmemcpy(v0, "FLAG{STACK-STRINGS-ARE-BEST-STRINGS}", sizeof(v0));
  [...]
 }

P.S. If you want to play with the Easter egg binary and reproduce the results in this post, download it here:easter2022.zip