Skip to main content

Apple2TC: an Apple II Binary to C Decompiler - Part 6, The IR

 This is part 6 in a series of posts describing Apple2TC. 

The start is in Part 1.

Part 5 described getting two Apple II games to work when decompiled in "Simple C" mode.

In this part, we introduce our new Intermediate Representation (IR) of the disassembled code, which unlocks much more sophisticated analysis and finally puts us on the road to real decompilation.

What is an IR?

An IR is a way to represent executable code as a data structure in memory, in a manner that makes its semantics explicit, and thus easier to analyze and transform. Typically, one or more forms of IR are used by optimizing compilers to represent and optimize the source code they are given. We are coming from the opposite direction - we need to represent code that has already been compiled - but there is little difference. The same principles apply.

Wikipedia has a high level article about Intermediate Representation, which, while not very informative on its own, can be used as a starting point by a curious reader.

An IR typically consists of  set of primitive operations like addition, subtraction, comparison, and so on, which are often (but not always) encoded as sequential instructions, somewhat similar to assembly code.

To start getting some intuitive understand, here is an example of a hypothetical compiler, which is given the following C input:
unsigned a, b, c;
a = b + c - 10;
The compiler can represent it as the following IR:
%100 = Load.u32 %b
%101 = Load.u32 %c
%102 = Add.u32 %100, %101
%103 = Sub.u32 %102, 10
Store %a, %103
The first line loads the value of the variable b, and assigns it to virtual register %100. Similarly, the next line loads the value of variable c, and assigns it to virtual register %101. The next line adds the values of virtual registers %100 and %101, assigning the result to register %102. The next line subtracts the literal constant 10 from register %102 and assigns it to %103. Finally, the last line stores the value of register %103 into the variable a.

We can see how the C expression has been broken up into primitive operations. Since each primitive operation is so simple, it makes it easier for the compiler to reason about it.

But what are these virtual registers %101, %102, etc? They are just a notational convenience. In this IR, every operation that produces a value is thought to have its own virtual register, storing its value. So when we refer to %123, we mean "the value that was calculated by instruction 123". 

Because the value of an instruction is determined completely by the instruction itself and the values of its operands, this turns out to be a very convenient way to express how the data flows through the code. This style of IR is said to be in Static Single Assignment Form (typically abbreviated simple as SSA) and is used by the widely successful LLVM project.

Wikipedia's page on SSA can be used as a starting point for learning more.

Why Does Apple2TC Even Need an IR?

Why does a decompiler need an IR, which is normally used by literally the opposite tool - compilers? Surely the decompiler doesn't need to optimize the binary in order to decompile it?

While it may be a counter-intuitive at first, it turns out that a decompiler needs a lot of the same technologies as compilers, namely SSA-based IR, and later even an AST.

We need to be able to analyze the executable code and extract information from it: data flow information, control flow graph, etc (which happen to be the same things that a compiler needs). In order to do that, the code needs to be represented in a convenient form.

Why not just use the original 6502 instructions, perhaps fully decoded/expanded in a struct for easier handling? 

The problem with that approach is that 6502 instructions have complex semantics that make them very inconvenient as a basic unit to reason about. For example, an instruction can read a base address from the zero page, add a register to it, then read another address from memory, then finally update one or more CPU flags. It is virtually impossible to map this behavior into C semantics in a readable way.

Instead we need to be able to decompose this behavior into simpler units, which we can then reason about. Hopefully the examples later in this post will make it clear. 

The Apple2TC Intermediate Representation

Our IR is conceptually similar to the LLVM IR, but simpler. 

(Using the actual LLVM IR was a possibility, which we considered, but eventually decided against, because of the non-trivial complexity it would bring, not to mention a dependency which is orders of magnitude larger than our project...)

There are several classes of IR instructions:

Register Instructions

(LoadR8, StoreR8, LoadSP, StoreSP)

There are exactly two general purpose register instructions - LoadR8 to load the value of an 8-bit CPU register, and StoreR8 to update the value of an 8-bit CPU registers. These instructions are the only way to obtain or modify the value of a register, which makes it much easier to track when registers change. The StoreSP and LoadSP act similarly for the SP register (it needs to be treated specially).

The IR supports the following CPU registers as first class values: A, X, Y

Additionally, every 6502 CPU flag is represented as a separate 8-bit register: STATUS_N, STATUS_V, STATUS_B, STATUS_D, STATUS_I, STATUS_NotZ, STATUS_C. Keeping the flags as separate registers makes it much easier to track their values. 

Every "flag" register, except STATUS_N and STATUS_NotZ, is guaranteed to only have a value of 1 or 0. STATUS_N is guaranteed to always have a value of 0x80 or 0. Lastly, STATUS_NotZ can have any 8-bit value. These assumptions make it easier to emulate 6502 behavior using natural C-style arithmetic.

Arithmetic Instructions

(Shr8, Shr16, Add8, Add16, Sub8, Sub16...)

The IR has all the usual arithmetic operations: addition, subtraction, shifts, etc. The operand type and result of each IR instruction is always explicit. Most operations have 8-bit and 16-bit forms. There are also operations to truncate 16-bit into 8-bit values, and to expand 8-bit into 16-bit values with and without sign.

Comparison Instructions

(Cmp8eq, Cmp8ne, Cmp8ge, Cmp8ae...)

The comparison instructions compare two 8 bit values, with or without sign, producing an 8-bit boolean value (1 or 0). 

Memory Instructions

(Peek8, Poke8, RamPeek8, RamPoke8, Peek16al, Peek16un, RamPeek16al, RamPeek16un...)

The memory instructions read and write 8-bit and 16-bit values from memory. There are split in several categories:
  • Aligned 16-bit accesses (-al suffix). Used when the memory address is known to be aligned to 2.
  • Unaligned 16-bit accesses (-un suffix). Used when the address cannot be proven to be aligned to 2.
  • Side effect-free RAM accesses (Ram- prefix). Used when the address is known to not overlap Apple II I/O space. Such accesses have no side effects.
  • Volatile accesses (no prefix). Used when it cannot be proven than an address does not overlap the Apple II I/O space. Such accesses have potential side effects.

Branch Instructions

(Jmp, JTrue, JFalse...)

Jmp, JTrue and JFalse have the expected semantics: unconditional branch, branch if operand is true, branch if operand is false.

6502-specific instructions

The IR contains several instructions that deal with peculiarities of 6502 (like calculating the V flag, decimal arithmetic) or peculiarities of our disassembly approach (like the lack of a complete control flow graph and call graph). We won't cover these for now, since they are not that interesting. 

Putting It All Together

Now that we have some idea of the kinds of supported IR instructions, let's look at an example of decompiling one 6502 instruction into IR:
lda     ($50),y
This is a pretty complicated instruction: first it reads the 16-bit word at address $50, performs a 16-bit addition of the value with register Y, then reads a byte from the resulting address and stores it in register A. Finally, it updates the Z and N flags.

When we decompile it with Apple2tc, we get the following IR:
          %14 = RamPeek16al   0x0050
%15 = LoadR8 Y
%16 = ZExt8t16 %15
%17 = Add16 %14, %16
%18 = Peek8 %17
StoreR8 STATUS_NotZ, %18
%20 = And8 %18, 0x80
StoreR8 STATUS_N, %20
StoreR8 A, %18
Yes, that looks like a lot of IR for a single 6502 instruction, but that's precisely the point. By breaking the complicated instruction up into simpler operations, we don't need to reason about "lda ($addr),y" as a whole, instead we get to deal only with simple well-defined operations like additions and memory reads.

Let's go over the decompiled IR line by line:
  • %14 reads a 16-bit word from address 0x0050, which we know is correctly aligned.
  • %15 loads the 8-bit value of register Y.
  • %16 extends the value of register Y to 16-bit.
  • %17 adds the loaded 16-bit word to the "extended" value of Y, getting us the effective address of the original instruction.
  • %18 loads the 8-bit value from the effective address.
  • The next instruction updates the STATUS_NotZ flag by simply copying the value to it.
  • %20 extracts the sign bit from the value. 
  • The next instruction saves the sign bit into STATUS_N.
  • Finally, the last instruction stores the value in register A.
So there we have it: register A contains the correct value, and STATUS_NotZ and STATUS_N have been updated correctly. All this done with simple instructions that can be analyzed individually and perhaps even eliminated (as we will see in the next part of this blog).

This sequence of instructions can look a bit intimidating and is not always easy for a human to read. That's why Apple2tc has a mode where it prints the IR in a more user friendly way by detecting "expression trees". Here is the same code printed in "tree mode":
          %18 = Peek8         (Add16 (RamPeek16al 0x0050) (ZExt8t16 (LoadR8 Y)))
StoreR8 STATUS_NotZ, %18
StoreR8 STATUS_N, (And8 %18 0x80)
StoreR8 A, %18
This is the exact same IR, but where possible instructions have been combined into Lisp-style expressions. With some practice, it is much easier to read. 

A Lisp expression consists of an open parenthesis, followed by a function name, zero or more arguments separated by space, and finally closing parenthesis (I am simplifying, but if you want to learn more, check out Scheme).

So, looking at the third line, the second operand of StoreR8 is (And8 %18 0x80). The function name is And8 and the two arguments are %18 and 0x80. In other words, it means "perform an 8-bit binary and between the value of instruction 18 and the constant 0x80". Now if we read the entire line, it becomes clear that we are extracting the sign bit of the result of instruction 18 and storing it in STATUS_N.

Instruction 18 is now hopefully easier to understand too: "read a 16-bit word from 0x0050, add it to the 16-bit value of Y, then read a byte from the resulting address".

To Be Continued

In Part 7 we will look in more detail at how we can transform and simplify the IR to discover the original intent of the code.

Comments

Popular posts from this blog

You Don't Like Google's Go Because You Are Small

When you look at Google's presentations about Go, they are not shy about it. Go is about very smart people at Google solving very BIG problems. They know best. If you don't like Go, then you are small and are solving small problems. If you were big (or smart), you would surely like it. For example, you might naively think that printing the greater of two numbers should be as simple as std::cout << max(b(),c()) That is because you think small. What you really should want is: t1 := b() if t2 := c(); t1 < t2 { t1 = t2 } fmt.Print( t1 ) Isn't it much better? We didn't have to type all those extra semicolons that were killing productivity before. If you don't like it, you are small. If you wanted to extract an attribute of an optional parameter, you may be used to typing something like: a = p ? p->a : 0; or even: a = p && p->a You just make me sad because obviously what you really want is: a = 0 if p != nil { a = p-...

How to speed up a micro-benchmark 300x

How to speed up a ubenchmark 300x Static Hermes: How to Speed Up a Micro-benchmark by 300x Without Cheating This is the first of a series of light blog posts about Static Hermes, covering topics that I find interesting or amusing. It is not intended to be a systematic introduction to Static Hermes design.Consider it more like a backstage pass to the quirks, features, and “aha!” moments that make Static Hermes unique. If you’re not familiar with Static Hermes, it’s an evolving project we’re working on to explore the confluence of static typing and JavaScript. It is work in progress, but we’re excited about the possibilities it opens up for performance improvements and more. For more background: Tweet with the slide deck of the Static Hermes announcement Previous talk about Hermes Contents: Meet interp-dispatch.js Let’s Run It! Static Hermes with Untyped Code Helping the Compiler Other Ways to Help the Compiler Some Observations Revisiting The Origina...

Apple2TC: an Apple II Binary to C Decompiler - Part 1

This is a series of blog posts to serve as a log documenting my work on Apple2TC - an open source hobby project developed on GitHub:  https://github.com/tmikov/apple2tc . There are various interesting things that come up all the time and I thought it might be useful to record them for posterity. Part 2  describes the different components of the project Part 3 shows how we validate our disassembly by generating a working C representation of it. Part 4 shows the Apple II tricks we had to deal with to get to correct running code. Part 5 describes decompiling and running Robotron 2084 and Snake Byte. Part 6 introduces our SSA-based intermediate representation . Part 7 shows how to transform the IR to discover the original intent of the code.   What is Apple2TC? Apple2TC is an open source project to decompile original Apple II binaries, mostly games, into understandable and working modern C code,  completely automatically  by analyzing the runtime behavior...