Skip to main content

Apple2TC: an Apple II Binary to C Decompiler - Part 1

This is a series of blog posts to serve as a log documenting my work on Apple2TC - an open source hobby project developed on GitHub: https://github.com/tmikov/apple2tc. There are various interesting things that come up all the time and I thought it might be useful to record them for posterity.

  • Part 2 describes the different components of the project
  • Part 3 shows how we validate our disassembly by generating a working C representation of it.
  • Part 4 shows the Apple II tricks we had to deal with to get to correct running code.
  • Part 5 describes decompiling and running Robotron 2084 and Snake Byte.
  • Part 6 introduces our SSA-based intermediate representation.
  • Part 7 shows how to transform the IR to discover the original intent of the code.  

What is Apple2TC?

Apple2TC is an open source project to decompile original Apple II binaries, mostly games, into understandable and working modern C code, completely automatically by analyzing the runtime behavior of the code via custom software emulation. 

When we say automatically, we really mean it - Apple2TC is not intended to be an interactive disassembler like IDA Pro or Ghidra. Our goal is to take an existing binary through the Apple2TC pipeline, without manually debugging it or inspecting its assembly, and in the end obtain portable and working C code, runnable on a modern system.



The Magic Of Custom Software Emulation

While it sounds like a tall order, we believe that we can achieve this by running the original binary in a custom software Apple II emulator, recording all necessary aspects of its runtime behavior and then using the data as input to our decompiler.

Ordinarily, when using traditional disassemblers, we would first try to identify where all the code routines are, the jump tables, create theories on how the code works, then gradually refine these theories by iterative debugging and disassembly sessions. It is a long and painful process, although it can also be quite intellectually satisfying. The author recently made quite a bit of progress in decompiling an IBM PC version of the game Bolo to C here: https://github.com/tmikov/bolo. The result is playable in the browser, compiled to Wasm. 

While satisfying, manual decompilation takes a lot of work of time, and when completed it only applies to a single game. It doesn't scale at all. So, after spending all of that time on decompiling one game, we started thinking - isn't there a way to automate this process?

We believe there is, and it is to record the runtime behavior of the binary and then use that information plus the binary itself as input to decompilation.

Yes, we realize that this isn't completely hands off - in order to have our emulator record the runtime behavior of a game, we need to actually play the game. But come on, that is much more fun than disassembling a game manually, plus if we didn't like to play Apple II games, we wouldn't be doing this at all.

We are happy to report that we have been able to validate our approach and have successfully generated a runnable version of Applesoft BASIC, buildable on a modern OS and running without emulation. The generated C code is not in the shape we want it to be yet, but more on that later.

Decompilation to C

Decompiling 6502 assembler to C is not an easy problem. Apple II games were not developed in C, so we can't literally decompile them. They were never compiled to begin with. They were written directly in 6502 assembler using all kinds of tricks (we will describe some of them as we encounter them). Often the behavior doesn't even cleanly map to C at all.

Still, we believe that we can recreate the equivalent behavior in C as long as we are able to analyze it. The main thing working in our favor is that the Apple II was a tiny computer by today's standards. It only had 64KB of memory, which had to fit everything - the ROM (Applesoft BASIC, Monitor), DOS, graphics pages, stack and the code for the actual game. It was also very slow, its CPU was clocked only at 1 MHz and each CPU instruction took multiple cycles.

Yes, it was a miracle that any realtime graphics games were able to run on the Apple II, but that is really why we want to honor these games and the programmers who created them by decompiling them into C and giving them a new life. People will be able to play these games easily without emulators and understanding the code without knowing 6502 assembly.

(Sadly, knowing 6502 assembler language is not as useful as it used to be, and learning it, while possibly fun, is not necessarily a great time investment. The author was fortunate to learn it on real hardware decades ago.)

Back to decompilation: running on such a tiny and slow computer means that the programs we are analyzing are also really tiny. We can afford to use all kinds of expensive algorithms and data structures that wouldn't be practical for decompiling modern software. We can also afford to extend our Apple II emulator to record whatever we want while remaining much much faster than the original computer. 

If we can't write software to analyze and understand a 20KB binary from 30 years ago running on a computer hundreds of times slower than our phones, we don't deserve to be called programmers... 

To Be Continued

This is the end of Part 1, an introduction to the project. In Part 2, we will describe the various parts of the pipeline that have already been developed. 

Comments

  1. Will this work with .dsk or .woz formats? I notice the .b33 processing to .json, but it wasn't clear to me what processing took (Robotron, for example) you from .dsk to .b33

    ReplyDelete

Post a Comment

Popular posts from this blog

You Don't Like Google's Go Because You Are Small

When you look at Google's presentations about Go, they are not shy about it. Go is about very smart people at Google solving very BIG problems. They know best. If you don't like Go, then you are small and are solving small problems. If you were big (or smart), you would surely like it. For example, you might naively think that printing the greater of two numbers should be as simple as std::cout << max(b(),c()) That is because you think small. What you really should want is: t1 := b() if t2 := c(); t1 < t2 { t1 = t2 } fmt.Print( t1 ) Isn't it much better? We didn't have to type all those extra semicolons that were killing productivity before. If you don't like it, you are small. If you wanted to extract an attribute of an optional parameter, you may be used to typing something like: a = p ? p->a : 0; or even: a = p && p->a You just make me sad because obviously what you really want is: a = 0 if p != nil { a = p-...

How to speed up a micro-benchmark 300x

How to speed up a ubenchmark 300x Static Hermes: How to Speed Up a Micro-benchmark by 300x Without Cheating This is the first of a series of light blog posts about Static Hermes, covering topics that I find interesting or amusing. It is not intended to be a systematic introduction to Static Hermes design.Consider it more like a backstage pass to the quirks, features, and “aha!” moments that make Static Hermes unique. If you’re not familiar with Static Hermes, it’s an evolving project we’re working on to explore the confluence of static typing and JavaScript. It is work in progress, but we’re excited about the possibilities it opens up for performance improvements and more. For more background: Tweet with the slide deck of the Static Hermes announcement Previous talk about Hermes Contents: Meet interp-dispatch.js Let’s Run It! Static Hermes with Untyped Code Helping the Compiler Other Ways to Help the Compiler Some Observations Revisiting The Origina...