Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!waikato.ac.nz!comp.vuw.ac.nz!windy!srwmpnm From: srwmpnm@windy.dsir.govt.nz Newsgroups: comp.sys.amiga.emulations Subject: 8 methods to emulate a Z80 Summary: 8 methods to emulate a Z80 on a 68000 Keywords: Z80 68000 Amiga Message-ID: <18847.27d80900@windy.dsir.govt.nz> Date: 8 Mar 91 21:58:24 GMT Organization: DSIR, Wellington, New Zealand Lines: 259 Ok folks, here are 8 methods for doing z80 emulation on a 68000, in software. (Well, 8 methods to get to decode a z80 instruction and get to the right emulation routine, anyway.) Trade-offs are speed, space and cleanliness. They all fall short of "compiling and optimising", but most of these methods will speed up most existing emulators. As you might expect, the largest and dirtiest code is usually the fastest (and least portable). The same methods should work with emulation of 6502, PDP-11 and any other 16-bit processors. In all methods, I assume there is a 64kb block of memory representing the z80's address space, allocated by AllocMem (say), and pointed to by "z80ram". ------------------------------------------------------------------------------- Method 1: The "standard" method: I call this method "standard" because it's used in both of the CP/M z80 emulators I know about. The general idea is to decode the current instruction and jump to the appropriate emulation routine via a vector table. That is, like a CASE statement with 256 selections. The code is clean and re-entrant. ; Setup move.l z80ram,a2 ; load pseudopc lea.l optabl(pc),a1 ; a1 always points to optabl lea.l mloop(pc),a3 ; a3 always points to mloop ; Main loop (decode) starts here mloop: moveq #0,d0 ; 4 Execute appropriate subroutine. move.b (a2)+,d0 ; 8 Grab the next opcode and inc pc. asl #2,d0 ;10 D0 high word is still zero! move.l 0(a1,d0.w),a0 ;18 Get address of routine from table jmp (a0) ; 8 Do the subroutine. ;48 total cycles to decode even optabl: dc.l nop00,lxib,staxb,inxb,inrb,dcrb,mvib,rlc dc.l ... Each z80 instruction emulation routine ends with: jmp (a3) ------------------------------------------------------------------------------- Method 2: The "position-independent" method: This is slightly quicker, the executable is more than 1500 bytes smaller, and you get another register to play with in the emulator (a1 in this case). I currently use this method (or close to it) in my Spectrum emulator. The code is clean and re-entrant. move.l z80ram,a2 ; load pseudopc lea.l mloop(pc),a3 ; a3 always points to mloop mloop: moveq #0,d0 ; 4 clear opcode word move.b (a2)+,d0 ; 8 get opcode byte add.w d0,d0 ; 4 2 bytes per entry move.w optabl(pc,d0.w),d0 ;14 get offset of routine jmp optabl(pc,d0.w) ;14 do instruction ;44 total to decode even optabl: dc.w nop00-optabl,lxib-optabl,staxb-optabl,inxb-optabl dc.w inrb-optabl,dcrb-optabl,mvib-optabl,rlc-optabl dc.w ... Each instruction emulation routine ends with: jmp (a3) ------------------------------------------------------------------------------- Method 3: The "decode-at-end-of-instruction" method: (There are really 2 methods described here.) Take either method 1 or method 2. Instead of ending each emulation routine with "jmp (a3)", end each one with a complete copy of the code from mloop to the indirect jmp. There is no longer a main loop, because each instruction jumps directly to the next one. This method is slightly faster, takes maybe twice as much code, is clean, and is re-entrant. It also saves yet another reserved register, in this case a3. (Personally, I find that a z80 emulator needs as many free registers as you can get your fingers on.) ------------------------------------------------------------------------------- Method 4: The "threaded jsr's" method: Warning: This method uses self-modifying, non-re-entrant code, and therefore is not recommended. This code is hazardous to your cache! (No flames please --- read on). Introduce a 390kb contiguous block of code (called thread) which looks like this: thread: jsr patch ; 0 jsr patch ; 1 ... jsr patch ; 65535 jmp thread That is, there is a jsr instruction for each byte in the z80's address space. This is in addition to z80ram. To start the emulator, you transfer control to "thread". What the "patch" routine does is to replace the current "jsr patch" with "jsr this_routine", where this_routine is the emulation routine for the corresponding opcode in z80ram. Then patch jmps to the this_routine to execute the instruction and to return to the next jsr in the thread. After a while, patch will no longer be called (except by z80 self modifying code), and every jsr made will be to emulate a z80 opcode directly. Whenever a z80 instruction writes to RAM, it patches the corresponding "jsr this_routine" with "jsr patch". As a variation, it could patch "jsr this_routine" with "jsr new_routine", but that would probably be slower in general. Advantage: It would be faster than methods 1 to 3, --- I think, --- especially in the Spectrum emulator, which has to do a lot of work with every write to RAM to check for ROM and video RAM anyway. The main reason for the extra speed is that it no longer has to decode the opcode on every instruction. There are the extra overheads of call and return though, and extra work to do on every RAM write. Disadvantages: 1: The code breaks C='s self-modifying code law. To run on Amiga's with caches, it would have to either disable the caches or update them manually after every patch. The code is extremely dirty, not re-entrant, and definitely not recommended; 2: You need 390k contiguous memory (plus another 64k somewhere else, plus whatever else you need for video). Other characteristics: Code would run slowly the first time round the loop, then speed up. -------------------------------------------------------------------------- Method 5: The "replicated code" method. Warning: This also uses self-modifying, non-re-entrant code and is therefore not recommended. Thread consists of 65536 blocks of code, each long enough to emulate the trickiest z80 instruction. Initially it contains 65536 copies of patch. (You will need A LOT of contiguous memory.) What patch does is to actually copy the code for the opcode over itself, then transfer control to the beginning of itself. (Tricky, but it can be done.) Every emulation routine finishes with a "bra.s next_instr" so they are all really the same length. That saves the call and return overhead. If an emulation routine is too long, then just use a jmp to somewhere. In practice, you would probably start with: jsr patch bra.s next_instr in every slot, rather than a complete copy of patch. Z80 RAM writes would copy the above code to the corresponding slot, if necessary, rather than copying the whole patch routine. Short of "compiling and optimising", this is the fastest method I can think of, but it is incredibly space-wasting, self-modifying, extremely dirty, and definitely not recommended. -------------------------------------------------------------------------- Method 6: The "threaded vector table" method: Ok, now to fix the self-modifying code problem. Take method 4 (threaded jsr's), but use a 262kb vector table in a private data segment, instead of a thread in the code segment. vectors: dc.l patch ; 0 dc.l patch ; 1 ... dc.l patch ; 65535 dc.l jmp_thread The main instruction loop looks like: lea.l vectors,a0 lea.l mloop(pc),a2 mloop: move.l (a0)+,a1 ;12 cycles jmp (a1) ; 8 cycles and every instruction finishes with "jmp (a2)". A0 is acting as a "pseudo-pc" into the vector table. Of course patch performs the same functions as before (except it is no longer self modifying, it just patches a vector). The vector table still needs to be updated by every write to Z80 RAM. The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Method 7: The "position-independent threaded vector table" method: Same as method 6, except that now the private data segment is: thread: dc.w patch-base ; 0 dc.w patch-base ; 1 ... dc.w patch-base ; 65535 dc.w jmp_thread-base and the main loop is: lea.l thread,a0 lea.l mloop(pc),a1 mloop: move.w (a0)+,d0 ; 8 cycles jmp base(pc,d0.w) ;14 cycles base: patch: ... op00: ... op01: ... jmp_thread: ... Now it is position-independent, only 128kb contiguous memory, the executable is 1500 bytes smaller, and it is slightly slower (only by 2 cycles per z80 instruction though). The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Method 8: The "decode-at-end-of-instruction threaded vector table" method: Same as method 6 except that every opcode emulation routine finishes with: move.l (a0)+,a1 jmp (a1) instead of "jmp (a2)". Now isn't that faster? And it saves a2 for more important things. Unfortunately you can't do exactly the same thing to method 7 unless you can write a complete z80 emulator in 256 bytes 8-) . But you could take method 7 and end each emulation routine with: mloop: move.w (a0)+,d0 lea.l base(pc),a1 jmp 0(a1,d0.w) instead. The code is re-entrant provided each task has a separate copy of the vector table. -------------------------------------------------------------------------- Personally I'm considering using one of the methods 6, 7 or 8 in the next version of the Spectrum emulator (probably method 8) (That is, if I ever get enough spare time without more interesting things to do.) I'll probably make the source public domain. That will use more Amiga RAM, but should go faster (I hope). Any guesses as to which method will be the fastest, and still fit comfortably in a 512k machine? Unfortunately I don't think any of the methods (except the first 3) are suitable for an 8088 emulator because of the huge memory requirements. I'm interested in any ideas anyone might have along these lines. The discussion of "compiling and optimising" is very interesting, but I don't see how the details would work. In particular, how do you cope with self-modifying code, code loaders, overlays etc? Peter McGavin. (srwmpnm@wnv.dsir.govt.nz) Disclaimer: I haven't tested any of the above ideas (except 1 and 2). If you see any bugs, point them out.