This is a series of blog posts to serve as a log documenting my work on Apple2TC - an open source hobby project developed on GitHub: https://github.com/tmikov/apple2tc. There are various interesting things that come up all the time and I thought it might be useful to record them for posterity.
- Part 2 describes the different components of the project
- Part 3 shows how we validate our disassembly by generating a working C representation of it.
- Part 4 shows the Apple II tricks we had to deal with to get to correct running code.
- Part 5 describes decompiling and running Robotron 2084 and Snake Byte.
- Part 6 introduces our SSA-based intermediate representation.
- Part 7 shows how to transform the IR to discover the original intent of the code.
What is Apple2TC?
Apple2TC is an open source project to decompile original Apple II binaries, mostly games, into understandable and working modern C code, completely automatically by analyzing the runtime behavior of the code via custom software emulation.
When we say automatically, we really mean it - Apple2TC is not intended to be an interactive disassembler like IDA Pro or Ghidra. Our goal is to take an existing binary through the Apple2TC pipeline, without manually debugging it or inspecting its assembly, and in the end obtain portable and working C code, runnable on a modern system.
The Magic Of Custom Software Emulation
While it sounds like a tall order, we believe that we can achieve this by running the original binary in a custom software Apple II emulator, recording all necessary aspects of its runtime behavior and then using the data as input to our decompiler.
Ordinarily, when using traditional disassemblers, we would first try to identify where all the code routines are, the jump tables, create theories on how the code works, then gradually refine these theories by iterative debugging and disassembly sessions. It is a long and painful process, although it can also be quite intellectually satisfying. The author recently made quite a bit of progress in decompiling an IBM PC version of the game Bolo to C here: https://github.com/tmikov/bolo. The result is playable in the browser, compiled to Wasm.
While satisfying, manual decompilation takes a lot of work of time, and when completed it only applies to a single game. It doesn't scale at all. So, after spending all of that time on decompiling one game, we started thinking - isn't there a way to automate this process?
We believe there is, and it is to record the runtime behavior of the binary and then use that information plus the binary itself as input to decompilation.
Yes, we realize that this isn't completely hands off - in order to have our emulator record the runtime behavior of a game, we need to actually play the game. But come on, that is much more fun than disassembling a game manually, plus if we didn't like to play Apple II games, we wouldn't be doing this at all.
Yes, we realize that this isn't completely hands off - in order to have our emulator record the runtime behavior of a game, we need to actually play the game. But come on, that is much more fun than disassembling a game manually, plus if we didn't like to play Apple II games, we wouldn't be doing this at all.
We are happy to report that we have been able to validate our approach and have successfully generated a runnable version of Applesoft BASIC, buildable on a modern OS and running without emulation. The generated C code is not in the shape we want it to be yet, but more on that later.
Decompilation to C
Decompiling 6502 assembler to C is not an easy problem. Apple II games were not developed in C, so we can't literally decompile them. They were never compiled to begin with. They were written directly in 6502 assembler using all kinds of tricks (we will describe some of them as we encounter them). Often the behavior doesn't even cleanly map to C at all.
Still, we believe that we can recreate the equivalent behavior in C as long as we are able to analyze it. The main thing working in our favor is that the Apple II was a tiny computer by today's standards. It only had 64KB of memory, which had to fit everything - the ROM (Applesoft BASIC, Monitor), DOS, graphics pages, stack and the code for the actual game. It was also very slow, its CPU was clocked only at 1 MHz and each CPU instruction took multiple cycles.
Yes, it was a miracle that any realtime graphics games were able to run on the Apple II, but that is really why we want to honor these games and the programmers who created them by decompiling them into C and giving them a new life. People will be able to play these games easily without emulators and understanding the code without knowing 6502 assembly.
(Sadly, knowing 6502 assembler language is not as useful as it used to be, and learning it, while possibly fun, is not necessarily a great time investment. The author was fortunate to learn it on real hardware decades ago.)
(Sadly, knowing 6502 assembler language is not as useful as it used to be, and learning it, while possibly fun, is not necessarily a great time investment. The author was fortunate to learn it on real hardware decades ago.)
Back to decompilation: running on such a tiny and slow computer means that the programs we are analyzing are also really tiny. We can afford to use all kinds of expensive algorithms and data structures that wouldn't be practical for decompiling modern software. We can also afford to extend our Apple II emulator to record whatever we want while remaining much much faster than the original computer.
If we can't write software to analyze and understand a 20KB binary from 30 years ago running on a computer hundreds of times slower than our phones, we don't deserve to be called programmers...
To Be Continued
This is the end of Part 1, an introduction to the project. In Part 2, we will describe the various parts of the pipeline that have already been developed.
Will this work with .dsk or .woz formats? I notice the .b33 processing to .json, but it wasn't clear to me what processing took (Robotron, for example) you from .dsk to .b33
ReplyDelete