Path: utzoo!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!cs.utexas.edu!uunet!comp.vuw.ac.nz!windy!srwmpnm From: srwmpnm@windy.dsir.govt.nz Newsgroups: comp.sys.amiga.emulations Subject: Re: 8 methods to emulate a Z80 Summary: 4 more methods to emulate a Z80 Keywords: Z80 68000 Amiga Emulator Message-ID: <18850.27dbf20e@windy.dsir.govt.nz> Date: 11 Mar 91 21:09:34 GMT References: <18847.27d80900@windy.dsir.govt.nz> Organization: DSIR, Wellington, New Zealand Lines: 212 Here are 4 more methods for doing z80 instruction decoding on a 68000. These methods are all based on fixed-size instruction emulation routines. There are some general comments, on handling multiple-byte opcodes, writing to hardware registers, and flag handling, at the end of this post. These 4 methods are faster than methods 1..3 (see previous post), they do not have the RAM-write overheads of methods 4..8, and they do not require tables of opcode routine address offsets. Some of these methods might be superior to all previous methods. In all these methods, each z80 instruction emulation routine is fixed at 256 bytes in length. (See earlier post by Ian Farquhar.) They are coded in the sequence op_80, op_81, ... op_ff, op_00, op_01, ... op_7f. So there is exactly 64k of code. If a routine is shorter than 256 bytes, the extra space is wasted. If a routine is longer than 256 bytes then it will need a jsr to somewhere. ------------------------------------------------------------------------------- Method 9: The "self-modifying fixed-size routine" method: Warning: Self modifying code follows. This method is almost the same as in Ian Farquhar's earlier c.s.a.e post. setup: lea.l op_00(pc),a1 ; a1 always points to op_00 move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,1$-op_00+4(a1) ;16 patch $ff in jmp instruction, inc pc 1$: jmp $ff00(a1) ;10 Jump to routine. ;26 total cycles Every instruction emulation routine ends with a copy of mloop, rather than a jump back to mloop. The move.b patches the high byte of the offset in the second instruction, so the jump goes to the right routine. (1$-op_00+4 might be 1$-op_00+2 --- I haven't checked.) The code is extremely fast (maybe the fastest yet), but it does not work on Amigas with memory caches. I thought of making it even faster by permanently setting a4 to the address of the byte to patch, and then using: mloop: move.b (a2)+,(a4) ;12 Patch $ff in jmp instruction, inc pc jmp $ff00(a1) ;10 Jump to routine. ;22 total cycles but you run out of reserved registers if there are lots of copies of mloop. (I.e, you can't use the decode-at-end-of-instruction technique.) Using "jmp (a3)" at the end of every routine makes it slower. ------------------------------------------------------------------------------- Method 10: The "standard fixed-size routine" method: Now we eliminate self-modifying code. The main loop is coded into the wasted space at the end of op_ff (so that it is within 128 bytes of op_00 --- Remember that op_00 is in the middle of the 64k code block). setup: subq.l #2,sp ; make room for scratch area on stack clr.w (sp) ; low byte of scratch area is always 0 lea.l mloop(pc),a3 ; a3 always points to mloop move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,(sp) ;12 Opcode to scratch high byte, inc pc move.w (sp),d0 ; 8 high byte is opcode, low byte is 0 jmp op_00(pc,d0.w) ;14 Jump to routine. Each z80 instruction emulation routine ends with: jmp (a3) ;10 ;44 total cycles Unfortunately it's quite a bit slower. We can do better... ------------------------------------------------------------------------------- Method 11: The "decode-at-end-of-instruction fixed-size routine" method: Register a3 always points to op_00 instead of to mloop, and we have: setup: subq.l #2,sp ; make room for scratch area on stack clr.w (sp) ; low byte of scratch area is always 0 lea.l op_00(pc),a3 ; a3 always points to op_00 move.l z80ram,a2 ; load pseudopc mloop: move.b (a2)+,(sp) ;12 Opcode to scratch high byte, inc pc move.w (sp),d0 ; 8 high byte is opcode, low byte is 0 jmp 0(a3,d0.w) ;14 Jump to routine. ;34 total cycles Each z80 instruction emulation routine ends with a copy of the decode routine. This is faster than method 10, and mloop can be coded anywhere. Can we avoid using scratch memory and still be as fast? Think about how you might do this before you read on. An 8-bit shift of register d0 avoids using scratch memory, but is slower (on a plain 68000). The next method shows how to make the decode faster and avoid using scratch memory, but it (possibly) introduces overhead elsewhere. ------------------------------------------------------------------------------- Method 12: The "stretched-address-space fixed-size-routine" method: This method assumes that the z80 address space (z80ram) is stretched to 128k, so that each byte in the z80's address space takes up a word in the Amiga. The low order byte of every word must always be 0. setup: move.l z80ram,a2 ; load pseudopc lea.l op_00(pc),a3 ; a3 always points to op_00 mloop: move.w (a2)+,d0 ; 8 Opcode to d0 high byte, inc pc jmp 0(a3,d0.w) ;14 Jump to routine. ;22 total cycles For best results, every routine ends with the mloop code (decode at end of instruction). The instruction decode is faster than method 11, but now many instructions will have extra work to do to convert byte z80 addresses to word amiga addresses. Still, this code looks good enough to try. Miscellaneous hint #9: To convert a byte offset to a word offset, use "add.w d0,d0", not "lsl.w #1,d0". Another miscellaneous hint: Maybe there's a use for movep here. You could maintain 2 copies of the z80 address space --- one in 64k and the other in 128k. Then it's just a simple matter of writing a byte to both places whenever the z80 does a write. That gets rid of the overhead of converting between offset types on memory reads. But now our method is starting to look like threaded code (method 8) again. The threaded code method uses the 128k block to store the offset to the handling routine, rather than storing the opcode itself. The overhead in doing a memory write is the same in both methods, and threaded code has other advantages (like not having to pad code to 256 bytes, multiple-byte opcodes handled better). So we're back to threaded code again. ------------------------------------------------------------------------------- A note on multiple-byte opcode instructions: The z80 uses these. They are prefixed with $cb, $dd, $ed and $fd. There are also $ddcb and $fdcb prefixed instructions. For fixed-size methods, (methods 9..12), the fastest way to cope with long instructions is to use more tables. But that means multiplying the number of tables by 5 or 7. 64k of code has just jumped to 320k or 448k. That's no good on a small machine. Also, if you reserve a register to point to op_00 in each table, that's 5 or 7 registers gone. Oops. ------------------------------------------------------------------------------- Some notes on threaded code: I spent last evening trying threaded code (method 8) in the Spectrum emulator. (See previous post.) I got a 20..40% speed improvement over the position- independent standard method (256-way CASE statement). It's still several times slower than a real Spectrum, unfortunately. There is some slow code in places where there wasn't before, so there is scope for more improvement. Unfortunately I introduced some bugs during the systematic changes, and they are proving hard to track down. Everything in the Spectrum ROM seems to work ok. Some machine-code programs that worked before have stopped working. Threaded code is extremely fast for multiple-opcode instructions. Control is vectored directly to the right routine first time, without having to decode multiple tables. A problem with this is that if a Spectrum program overwrites the second opcode of a multiple-byte instruction ($cb, $dd, $ed, $fd) without writing to the first byte, then the emulator doesn't cope. ------------------------------------------------------------------------------- Note on hardware registers: Handling writes to hardware registers in the Spectrum isn't really very hard. The z80 has a separate IO address space with a separate set of instructions for handling it. (The same is true of the 8088.) The only thing to watch for is writing to video RAM. It is fixed size and at a fixed place, so it takes 2 tests. (There doesn't seem to be a faster way of doing a single bit test --- the video RAM doesn't end on a power-of-2 boundary.) I have a separate task (sort of) which uses the blitter to keep the screen up-to-date with the video RAM. So when there is a write to video RAM, the emulator task doesn't have to do much. It just flags the blitter task "Hey, there's something to update in character row n, when you wake up". The blitter task doesn't slow the emulator down much, because it's mostly running on another processor, and it sleeps when there's nothing to update. ------------------------------------------------------------------------------- Some notes on flag handling: Both of the CP/M emulators I know about spend a lot of time handling z80 flags (condition codes). After just about every instruction they do a "move sr,d0" or "move ccr,d0" or call GetCC() to get the 68000 flags, then they do a table lookup to translate them to z80 format. After every logical instruction (not, or, xor etc), a second table lookup is done to set the z80 parity flag. (The 68000 does not have a parity flag.) These table lookups are slow. In fact, they often take several times as long as the guts of the instruction itself. Both these table lookups are totally unnecessary! It's faster to save the flags in 68000 format (in a register). Routines that test flags simply test the corresponding 68000 flag. For logical instructions, simply save the parity byte away somewhere, and set another bit in the register to say to use the parity byte and not v flag. The parity testing instructions (e.g, "jp po,nn") look at that bit, and then test either the v flag or the saved parity byte. The only times you need to translate flags between z80 and 68000 formats are in "push af" and "pop af" instructions. I got a 10..20% speedup in my Spectrum emulator this way. ------------------------------------------------------------------------------- I said in my previous post that threaded code for 8088 emulation would use too much memory to be practical. In fact it would be perfectly practical on an Amiga equipped with 3 Mbytes or more. Peter McGavin. (srwmpnm@wnv.dsir.govt.nz)