Skip to main content

Apple2TC: an Apple II Binary to C Decompiler - Part 4

This is part 4 in a series of posts describing Apple2TC. 

The start is in Part 1

In this part we finally get to the interesting part and describe the Apple II hacks we found while trying to run the decompiled version of Apple II BASIC.

Running the Decompiled Code

In the previous part we described our technique for generating a "Simple C" representation of the disassembled 6502 code. Now it is time to actually build and run the generated code!

Using our a2io and decapplib libraries, it is trivial to link the generated Simple C code into an executable. 

We started by decompiling and running the Apple II ROM code, containing Applesoft BASIC and Monitor, because its disassembled source is already available and heavily commented (thanks to Andy McFadden). We can "cheat" and look at the listing to check whether we are doing something wrong.

Problem 1: Indirect Branches

In our first attempt, we decompiled the ROM statically and used the Simple C backend to convert it to C. The generated code runs for a little bit, but crashes pretty early with
    Unknown code address: $FDF0
Fortunately we have augmented the decompiled code to print the address and register state at the beginning of every basic block, so we can get an idea of what is happening. The last couple of lines printed are:
FF3A:           A=00 X=38 Y=1B SP=FD SR=......Z. PC=FF3A
FDED:           A=87 X=38 Y=1B SP=FD SR=N....... PC=FDED
Searching for this address in https://6502disassembly.com/a2-rom/APPLE2.ROM.html immediately reveals the problem:
                   ; ==============================================================================
                   ; Character Output Routine: Jump (Indirectly via Vector) to User Output Routine
                   ; ==============================================================================
                   ; 
FDED: 6C 36 00     COUT            JMP   (CSWL)          ;Print Accumulator (A) to Output Device
We have an indirect branch, which our disassembler couldn't trace, so we attempted to jump to a branch target that we did not know about.

This was expected - it is notoriously difficult to deal with indirect branches and jump tables when performing automatic disassembly. The disassembler has to rely on various heuristics and complicated analysis to detect jump tables with varying degrees with success. That is why the core idea of our project is to rely on data collected at runtime instead. Runtime data gives us complete accuracy since it tells us what actually happened when the code ran.

Our emulator, a2emu, was designed to collect runtime data via an optional "debug" callback, invoked before every instruction. The callback can decode the instruction, examine its operands, and decide what and when to log. We expected that we would gradually add more and more data collection to the callback, as we identified the kind of runtime data we needed.

So, to start with, we add a check whether the current instruction is any kind of branch. If it is, we evaluate the value of its operand (the branch target), and save it in a std::unordered_set<>. When execution completes, we save the set to a JSON file.    

Later, the disassembler loads the file, adds the recorded branch targets to its existing set, and traces the new ones recursively, as them may lead to identifying new targets and code blocks. It tells us that 22 new labels were identified at runtime, so our efforts weren't for naught!

To our complete shock, running the newly generated code proceeds much farther, as can be seen in this screenshot. It prints the Apple II name and we can actually type on the input line!

The problem is, unfortunately, as soon as we press Enter, we crash and burn with another error... 


Problem 2: Dynamically Generated Code

This is what our log shows:
D444:           A=00 X=FF Y=01 SP=F8 SR=........ PC=D444
00B1:           A=00 X=FF Y=01 SP=F6 SR=.......C PC=00B1
Warning: INVALID at $00B1
In other words, we jumped to address $B1 in the zero page, where naturally we had not disassembled any code! We were genuinely surprised. What is Applesoft BASIC doing branching to the zero page? Isn't the zero page supposed to a very precious scarce resource for important variables? Are we putting actual code in there? Or is this a bug in our decompiler?

Looking for D444 in the disassembly shows us the code lading to it:
D444: 86 B8                        STX   TXTPTR          ;Set up CHRGET to Scan the Input Line
D446: 84 B9                        STY   TXTPTR+1        ;[(X,Y)={Low,High}]= CHRGET Next Chr/Tok Ptr 
D448: 46 D8                        LSR   ERRFLG          ;Defeat ONERR: Clear the Error Flag
D44A: 20 B1 00                     JSR   CHRGET          ;Get a Character
CHRGET itself is defined like this:
                   CHRGET          EQU   $B1             ;~$C8: Get Next Char/Token ZP-Routine (24B)
Aha, so it is not a bug! The ROM really is not only branching, but executing a subroutine at $B1, which it must have put there somehow!

This was surprising - while we expected to encounter some sorts of "shenanigans", we weren't fully mentally prepared to deal with code generated on the fly at runtime! Is all hope lost? Is this the end of our project?

Fortunately, we decided to not give up yet. Every problem has a solution. After all, we can record literally everything that happens in the emulated Apple II. Surely there is a way we can deal with this too?

In our emulator we can track when every byte of RAM is written to, and we can track when every instruction is executed. Perhaps we can track them simultaneously and detect regions of code that were first generated and then executed? Then we can record the contents of these regions and the disassembler can load them?

It is not perfect, but it is a starting point. So, we get to work and enhance the emulator to keep track of memory generations. Each generation contains all bytes that were written to and then executed without modifications in between. Here is a hypothetical sequence of events:

t1. Modify  $1000..$1001
t2. Modify  $0300..$0305
t3. Modify  $1002..$1003
t4. Modify  $0305..$0320
t5. Execute $0300..$0310
t6. Modify  $1004
t7. Modify  $0300 

At point t7, we are modifying a byte that was previously modified and executed. This causes us to close the previous generation of memory. It contains all bytes that were both modified and executed, which in this case is the range $0300..$0310. We record that range, and start a new generation. 

This approach allows us to track ranges of code that are generated at once. It would even work for dynamically loading parts of the code from disk, because first, lots of bytes are modified, then executed without modification.

We execute the enhanced emulator and lo and behold, it has automatically identified a 24 byte executable range at address $B1! It was a proud moment, since it truly happened automatically, without encoding any specific knowledge in the emulator!

"code": [
{
"addr": 177,
"bytes": [
230, 184, 208, 2, 230, 185, 173, 4, 2, 201, 58, 176, 10, 201, 32,
240, 239, 56, 233, 48, 56, 233, 208, 96
]
}
],

We update the decompiler to simply load the generated ranges, run it ... and again crash right away after pressing Enter, but this time with a different error:

DB00: A=0D X=10 Y=08 SP=F4 SR=.......C PC=DB00
D41C: A=F2 X=10 Y=08 SP=F6 SR=N......C PC=D41C
Unknown code address: $DB5A

Problem 3: Overlapping Instructions

Well, this is very strange. Examining the assembly listing produced by our own disassembler, we see that $DB5A is not a valid address!
/*DB57*/             LDA    #M_20            ; $0020
/*DB59*/ BIT M_3FA9 ; $3FA9
/*DB5C*/ L_DB5C: ; xref $D423 $D810 $DAFD $DB49
/*DB5C*/ ORA #$80 ; $0080
/*DB5E*/ CMP #M_A0 ; $00A0
/*DB60*/ BCC L_DB64 ; $DB64
$DB5A is in the middle of the BIT instruction! What the heck is going on? Is the entire listing incorrect? Doesn't look like it, the surrounding instructions seem legit.

Fortunately, the commented Apple II ROM listing is here to save us again!
DB57: A9 20        OUTSP           LDA   #' '            ;Print a Space via COUT
DB59: 2C                           DFB   BIT_ABS         ;Fake BIT OpCode to skip next line

                   ; ==============================================================================
                   ; Print a Question Mark Character:
                   ; ==============================================================================
DB5A: A9 3F        OUTQUES         LDA   #'?'            ;Print a Question Mark via COUT
                   ; 
                   ; ==============================================================================
                   ; Print Character in A-Reg
                   ; ==============================================================================
                   ; Note: POKE 243,32 [$20 in $F3 (FLASHBIT)] will convert output to lower case.
                   ;       This can be cancelled by: POKE 243,0 ($00), NORMAL ($00), INVERSE ($00);
                   ;       or FLASH ($40), POKE 243,64.
                   ; ==============================================================================
DB5C: 09 80        OUTDO           ORA   #%10000000      ;Set High Bit; Make Character High ASCII

Let's see what is going on there... They literally encoded the LDA inside the operand of the BIT instruction, so if you execute BIT, the LDA is skipped. However if you branch directly to the LDA, it executes normally. So, this is just a clever way to skip over the LDA when coming from one place, while executing it if coming from another. 

Ordinarily dong this would require an unconditional branch, which is one or possibly two extra bytes. Discoveries like this make all of this worth it!

Fortunately, this is not that difficult to deal with in our disassembler, we don't even need runtime data, now that we know that it is happening. We simply need another disassembler pass checking for misaligned labels (labels inside an instruction). We record them, and later, when emitting Simple C code, emit both cases. Specifically for the code above:
case 0xdb57: // [$DB57..$DB59]    3 bytes
/* $DB57 LDA */ s_a = update_nz(0x20);
/* $DB59 BIT */ tmp = peek(0x3fa9), s_status = (s_status & ~(0xC0 | STATUS_Z)) | (tmp & 0xC0) | (s_a & tmp ? 0 : STATUS_Z);
s_pc = 0xdb5c;
break;
// WARNING: simple misaligned label
case 0xdb5a: // [$DB5A..$DB5B] 2 bytes
/* $DB5A LDA */ s_a = update_nz(0x3f);
s_pc = 0xdb5c;
break;
case 0xdb5c: // [$DB5C..$DB61] 6 bytes
/* $DB5C ORA */ s_a = update_nz(s_a | 0x80);
So, with this modification, we recompile the binary and run it! 

This time it doesn't crash at Enter! In fact it doesn't crash at all (at least not right away), though it isn't behaving correctly either.

Still, it is progress!



Problem 4: Self-modifying Code

This problem was quite tricky to identify, though in retrospect it seems obvious. We hadn't really thought through why Applesoft BASIC chose to put the routine CHRGET in the zero page. We didn't even look closely at the code, we were so happy when we saw improvement that we didn't think twice about it.

The answer is, of course, they had a very good reason. They placed the routine in writable memory, so it can be modified! Otherwise they would have left it in ROM. Even looking at our own assembler listing makes it obvious:
/*00B1*/ L_00B1:
/*00B1*/ INC M_B8 ; $00B8
/*00B3*/ BNE L_00B7 ; $00B7
/*00B5*/ INC M_B9 ; $00B9
/*00B7*/ L_00B7:
/*00B7*/ LDA M_0204 ; $0204
/*00BA*/ CMP #$3A ; $003A
The instruction at address $00B1 is modifying the instruction at address $00B7! The subroutine is modifying itself! We have to admit, we were ecstatic to discover this. Those folks really knew what they were doing!

Examining the documented listing makes it even more clear - the variable pointing to the next BASIC character is at address $00B8. It is literally inside the LDA instruction.
                   TXTPTR          EQU   $B8             ;CHRGET's Next Char/Token Pointer      (2B)
One can't help but admire it. It wastes no cycles, no extra registers. That is how they were able to make software run on such a slow computer.

Fortunately for us, this is also fairly easy to handle. It turns out that we can statically identify when instructions are modifying the operands of other instructions. 

The distinction between modifying the first byte versus the operands is important! If it is the former, we are in trouble, since we can't (easily and reliably) predict what the modified instruction is.  But if it is the latter, it is possible to deal with. Since we have identified the instruction whose operand is being modified, we can generate slightly different code for it.

In this case, the instruction in question is:
/*00B7*/             LDA    M_0204           ; $0204
Before we knew it was modified, we generated this:
/* $00B7 LDA */ s_a = update_nz(peek(0x0204));
Now that we have identified it, we generate this:
// WARNING: operand self modification.
/* $00B7 LDA */ s_a = update_nz(peek(ram_peek16(0x00b8)));
Instead of embedding the disassembled operand as a literal in the emitted code, we fetch it from RAM where it would have been modified. Simple!

Armed with this self-modification detection, we build it an run it again.

Amazing! This is actual BASIC running without emulation, performing calculations.

Really excited, we try typing a FOR loop, and it crashes right away. In fact, entering almost any other command, besides HOME and PRINT, crashes.
 



Problem 5: Jump to a Subroutine

When we enter the command HGR, we get the following log:
D832:           A=11 X=04 Y=06 SP=F6 SR=N....... PC=D832
00B1:           A=E1 X=04 Y=22 SP=F4 SR=N....... PC=00B1
00B7:           A=E1 X=04 Y=22 SP=F4 SR=........ PC=00B7
00BE:           A=00 X=04 Y=22 SP=F4 SR=N....... PC=00BE
00C2:           A=00 X=04 Y=22 SP=F4 SR=N....... PC=00C2
Unknown code address: $F3E2

Looking at the documented disassembly, we see that $F3E2 is indeed the address of the HGR subroutine:
F3E2: A9 20        HGR             LDA   #>HGR1SCRN      ;Get Hi-Res Screen Pg.1 Base-Address, High
F3E4: 2C 54 C0                     BIT   TXTPAGE1        ;Display Text Page1; R/W Main V-RAM
But why don't we know its address? When we "trained" our emulator, we made sure to execute almost every BASIC command, to allow the emulator to "collect" all addresses for the decompiler.

The error log provides an explanation. The error occurs when executing the RTS instruction of CHRGET (the "notorious" self-modifying routine in the zero page). So, we don't know the address to return to. But why? The previous address in the log is $D832. What do we have there?
D832: 0A                           ASL   A               ;YES, Double A-Reg to get Index
D833: A8                           TAY                   ;Set Indirect Addressing Index
D834: B9 01 D0                     LDA   TKADTBL+1,Y     ;Get Routine Address, High
D837: 48                           PHA                   ;Push Routine Address, High     [High 1st]
D838: B9 00 D0                     LDA   TKADTBL,Y       ;Get Routine Address, Low
D83B: 48                           PHA                   ;Push Routine Address, Low       [Low 2nd]
D83C: 4C B1 00                     JMP   CHRGET          ;Go get Next Char/Token; RTS to Routine
Wow, they tricked us again! Instead of using a JSR to call the subroutine, they are explicitly pushing the address the subroutine should return to, and then calling it with a JMP! That's why we don't know the address!

Fortunately, this is surmountable. We modify our emulator to record RTS addresses as branch target, and voila! We run the newly decompiled ROM and finally we have a fully functioning Apple II BASIC environment without emulation!
 

It took quite a few steps, but we finally got there. Now that we have a validated disassembly, the next step is to convert it to readable C.

To Be Continued

In Part 5 we will describe successfully decompiling and running two Apple II games.

Comments

Popular posts from this blog

You Don't Like Google's Go Because You Are Small

When you look at Google's presentations about Go, they are not shy about it. Go is about very smart people at Google solving very BIG problems. They know best. If you don't like Go, then you are small and are solving small problems. If you were big (or smart), you would surely like it. For example, you might naively think that printing the greater of two numbers should be as simple as std::cout << max(b(),c()) That is because you think small. What you really should want is: t1 := b() if t2 := c(); t1 < t2 { t1 = t2 } fmt.Print( t1 ) Isn't it much better? We didn't have to type all those extra semicolons that were killing productivity before. If you don't like it, you are small. If you wanted to extract an attribute of an optional parameter, you may be used to typing something like: a = p ? p->a : 0; or even: a = p && p->a You just make me sad because obviously what you really want is: a = 0 if p != nil { a = p-...

How to speed up a micro-benchmark 300x

How to speed up a ubenchmark 300x Static Hermes: How to Speed Up a Micro-benchmark by 300x Without Cheating This is the first of a series of light blog posts about Static Hermes, covering topics that I find interesting or amusing. It is not intended to be a systematic introduction to Static Hermes design.Consider it more like a backstage pass to the quirks, features, and “aha!” moments that make Static Hermes unique. If you’re not familiar with Static Hermes, it’s an evolving project we’re working on to explore the confluence of static typing and JavaScript. It is work in progress, but we’re excited about the possibilities it opens up for performance improvements and more. For more background: Tweet with the slide deck of the Static Hermes announcement Previous talk about Hermes Contents: Meet interp-dispatch.js Let’s Run It! Static Hermes with Untyped Code Helping the Compiler Other Ways to Help the Compiler Some Observations Revisiting The Origina...

Apple2TC: an Apple II Binary to C Decompiler - Part 1

This is a series of blog posts to serve as a log documenting my work on Apple2TC - an open source hobby project developed on GitHub:  https://github.com/tmikov/apple2tc . There are various interesting things that come up all the time and I thought it might be useful to record them for posterity. Part 2  describes the different components of the project Part 3 shows how we validate our disassembly by generating a working C representation of it. Part 4 shows the Apple II tricks we had to deal with to get to correct running code. Part 5 describes decompiling and running Robotron 2084 and Snake Byte. Part 6 introduces our SSA-based intermediate representation . Part 7 shows how to transform the IR to discover the original intent of the code.   What is Apple2TC? Apple2TC is an open source project to decompile original Apple II binaries, mostly games, into understandable and working modern C code,  completely automatically  by analyzing the runtime behavior...