Apple2TC: an Apple II Binary to C Decompiler

Apple2TC: an Apple II Binary to C Decompiler - Part 6, The IR

This is part 6 in a series of posts describing Apple2TC.

The start is in Part 1.

Part 5 described getting two Apple II games to work when decompiled in "Simple C" mode.

In this part, we introduce our new Intermediate Representation (IR) of the disassembled code, which unlocks much more sophisticated analysis and finally puts us on the road to real decompilation.

What is an IR?

An IR is a way to represent executable code as a data structure in memory, in a manner that makes its semantics explicit, and thus easier to analyze and transform. Typically, one or more forms of IR are used by optimizing compilers to represent and optimize the source code they are given. We are coming from the opposite direction - we need to represent code that has already been compiled - but there is little difference. The same principles apply.

Wikipedia has a high level article about Intermediate Representation, which, while not very informative on its own, can be used as a starting point by a curious reader.

An IR typically consists of set of primitive operations like addition, subtraction, comparison, and so on, which are often (but not always) encoded as sequential instructions, somewhat similar to assembly code.

To start getting some intuitive understand, here is an example of a hypothetical compiler, which is given the following C input:

unsigned a, b, c;
a = b + c - 10;

The compiler can represent it as the following IR:

%100 = Load.u32 %b
%101 = Load.u32 %c
%102 = Add.u32 %100, %101
%103 = Sub.u32 %102, 10
       Store %a, %103

The first line loads the value of the variable b, and assigns it to virtual register %100. Similarly, the next line loads the value of variable c, and assigns it to virtual register %101. The next line adds the values of virtual registers %100 and %101, assigning the result to register %102. The next line subtracts the literal constant 10 from register %102 and assigns it to %103. Finally, the last line stores the value of register %103 into the variable a.

We can see how the C expression has been broken up into primitive operations. Since each primitive operation is so simple, it makes it easier for the compiler to reason about it.

But what are these virtual registers %101, %102, etc? They are just a notational convenience. In this IR, every operation that produces a value is thought to have its own virtual register, storing its value. So when we refer to %123, we mean "the value that was calculated by instruction 123".

Because the value of an instruction is determined completely by the instruction itself and the values of its operands, this turns out to be a very convenient way to express how the data flows through the code. This style of IR is said to be in Static Single Assignment Form (typically abbreviated simple as SSA) and is used by the widely successful LLVM project.

Wikipedia's page on SSA can be used as a starting point for learning more.

Why Does Apple2TC Even Need an IR?

Why does a decompiler need an IR, which is normally used by literally the opposite tool - compilers? Surely the decompiler doesn't need to optimize the binary in order to decompile it?

While it may be a counter-intuitive at first, it turns out that a decompiler needs a lot of the same technologies as compilers, namely SSA-based IR, and later even an AST.

We need to be able to analyze the executable code and extract information from it: data flow information, control flow graph, etc (which happen to be the same things that a compiler needs). In order to do that, the code needs to be represented in a convenient form.

Why not just use the original 6502 instructions, perhaps fully decoded/expanded in a struct for easier handling?

The problem with that approach is that 6502 instructions have complex semantics that make them very inconvenient as a basic unit to reason about. For example, an instruction can read a base address from the zero page, add a register to it, then read another address from memory, then finally update one or more CPU flags. It is virtually impossible to map this behavior into C semantics in a readable way.

Instead we need to be able to decompose this behavior into simpler units, which we can then reason about. Hopefully the examples later in this post will make it clear.

The Apple2TC Intermediate Representation

Our IR is conceptually similar to the LLVM IR, but simpler.

(Using the actual LLVM IR was a possibility, which we considered, but eventually decided against, because of the non-trivial complexity it would bring, not to mention a dependency which is orders of magnitude larger than our project...)

There are several classes of IR instructions:

Register Instructions

(LoadR8, StoreR8, LoadSP, StoreSP)

There are exactly two general purpose register instructions - LoadR8 to load the value of an 8-bit CPU register, and StoreR8 to update the value of an 8-bit CPU registers. These instructions are the only way to obtain or modify the value of a register, which makes it much easier to track when registers change. The StoreSP and LoadSP act similarly for the SP register (it needs to be treated specially).

The IR supports the following CPU registers as first class values: A, X, Y.

Additionally, every 6502 CPU flag is represented as a separate 8-bit register: STATUS_N, STATUS_V, STATUS_B, STATUS_D, STATUS_I, STATUS_NotZ, STATUS_C. Keeping the flags as separate registers makes it much easier to track their values.

Every "flag" register, except STATUS_N and STATUS_NotZ, is guaranteed to only have a value of 1 or 0. STATUS_N is guaranteed to always have a value of 0x80 or 0. Lastly, STATUS_NotZ can have any 8-bit value. These assumptions make it easier to emulate 6502 behavior using natural C-style arithmetic.

Arithmetic Instructions

(Shr8, Shr16, Add8, Add16, Sub8, Sub16...)

The IR has all the usual arithmetic operations: addition, subtraction, shifts, etc. The operand type and result of each IR instruction is always explicit. Most operations have 8-bit and 16-bit forms. There are also operations to truncate 16-bit into 8-bit values, and to expand 8-bit into 16-bit values with and without sign.

Comparison Instructions

(Cmp8eq, Cmp8ne, Cmp8ge, Cmp8ae...)

The comparison instructions compare two 8 bit values, with or without sign, producing an 8-bit boolean value (1 or 0).

Memory Instructions

(Peek8, Poke8, RamPeek8, RamPoke8, Peek16al, Peek16un, RamPeek16al, RamPeek16un...)

The memory instructions read and write 8-bit and 16-bit values from memory. There are split in several categories:

Aligned 16-bit accesses (-al suffix). Used when the memory address is known to be aligned to 2.
Unaligned 16-bit accesses (-un suffix). Used when the address cannot be proven to be aligned to 2.
Side effect-free RAM accesses (Ram- prefix). Used when the address is known to not overlap Apple II I/O space. Such accesses have no side effects.
Volatile accesses (no prefix). Used when it cannot be proven than an address does not overlap the Apple II I/O space. Such accesses have potential side effects.

Branch Instructions

(Jmp, JTrue, JFalse...)

Jmp, JTrue and JFalse have the expected semantics: unconditional branch, branch if operand is true, branch if operand is false.

6502-specific instructions

The IR contains several instructions that deal with peculiarities of 6502 (like calculating the V flag, decimal arithmetic) or peculiarities of our disassembly approach (like the lack of a complete control flow graph and call graph). We won't cover these for now, since they are not that interesting.

Putting It All Together

Now that we have some idea of the kinds of supported IR instructions, let's look at an example of decompiling one 6502 instruction into IR:

lda     ($50),y

This is a pretty complicated instruction: first it reads the 16-bit word at address $50, performs a 16-bit addition of the value with register Y, then reads a byte from the resulting address and stores it in register A. Finally, it updates the Z and N flags.

When we decompile it with Apple2tc, we get the following IR:

          %14 = RamPeek16al   0x0050
          %15 = LoadR8        Y
          %16 = ZExt8t16      %15
          %17 = Add16         %14, %16
          %18 = Peek8         %17
                StoreR8       STATUS_NotZ, %18
          %20 = And8          %18, 0x80
                StoreR8       STATUS_N, %20
                StoreR8       A, %18

Yes, that looks like a lot of IR for a single 6502 instruction, but that's precisely the point. By breaking the complicated instruction up into simpler operations, we don't need to reason about "lda ($addr),y" as a whole, instead we get to deal only with simple well-defined operations like additions and memory reads.

Let's go over the decompiled IR line by line:

%14 reads a 16-bit word from address 0x0050, which we know is correctly aligned.
%15 loads the 8-bit value of register Y.
%16 extends the value of register Y to 16-bit.
%17 adds the loaded 16-bit word to the "extended" value of Y, getting us the effective address of the original instruction.
%18 loads the 8-bit value from the effective address.
The next instruction updates the STATUS_NotZ flag by simply copying the value to it.
%20 extracts the sign bit from the value.
The next instruction saves the sign bit into STATUS_N.
Finally, the last instruction stores the value in register A.

So there we have it: register A contains the correct value, and STATUS_NotZ and STATUS_N have been updated correctly. All this done with simple instructions that can be analyzed individually and perhaps even eliminated (as we will see in the next part of this blog).

This sequence of instructions can look a bit intimidating and is not always easy for a human to read. That's why Apple2tc has a mode where it prints the IR in a more user friendly way by detecting "expression trees". Here is the same code printed in "tree mode":

          %18 = Peek8         (Add16 (RamPeek16al 0x0050) (ZExt8t16 (LoadR8 Y)))
                StoreR8       STATUS_NotZ, %18
                StoreR8       STATUS_N, (And8 %18 0x80)
                StoreR8       A, %18

This is the exact same IR, but where possible instructions have been combined into Lisp-style expressions. With some practice, it is much easier to read.

A Lisp expression consists of an open parenthesis, followed by a function name, zero or more arguments separated by space, and finally closing parenthesis (I am simplifying, but if you want to learn more, check out Scheme).

So, looking at the third line, the second operand of StoreR8 is (And8 %18 0x80). The function name is And8 and the two arguments are %18 and 0x80. In other words, it means "perform an 8-bit binary and between the value of instruction 18 and the constant 0x80". Now if we read the entire line, it becomes clear that we are extracting the sign bit of the result of instruction 18 and storing it in STATUS_N.

Instruction 18 is now hopefully easier to understand too: "read a 16-bit word from 0x0050, add it to the 16-bit value of Y, then read a byte from the resulting address".

To Be Continued

In Part 7 we will look in more detail at how we can transform and simplify the IR to discover the original intent of the code.

Tzvetan Mikov's Blog

Search This Blog