Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!waikato.ac.nz!comp.vuw.ac.nz!windy!srwmpnm
From: srwmpnm@windy.dsir.govt.nz
Newsgroups: comp.sys.amiga.emulations
Subject: 8 methods to emulate a Z80
Summary: 8 methods to emulate a Z80 on a 68000
Keywords: Z80 68000 Amiga
Message-ID: <18847.27d80900@windy.dsir.govt.nz>
Date: 8 Mar 91 21:58:24 GMT
Organization: DSIR, Wellington, New Zealand
Lines: 259

Ok folks, here are 8 methods for doing z80 emulation on a 68000, in software.
(Well, 8 methods to get to decode a z80 instruction and get to the right
emulation routine, anyway.)

Trade-offs are speed, space and cleanliness.  They all fall short of
"compiling and optimising", but most of these methods will speed up most
existing emulators.  As you might expect, the largest and dirtiest code is
usually the fastest (and least portable).  The same methods should work with
emulation of 6502, PDP-11 and any other 16-bit processors.

In all methods, I assume there is a 64kb block of memory representing the z80's
address space, allocated by AllocMem (say), and pointed to by "z80ram".

-------------------------------------------------------------------------------
Method 1: The "standard" method:

I call this method "standard" because it's used in both of the CP/M z80
emulators I know about.  The general idea is to decode the current instruction
and jump to the appropriate emulation routine via a vector table.  That is,
like a CASE statement with 256 selections.  The code is clean and re-entrant.

; Setup
	move.l	z80ram,a2		;   load pseudopc
	lea.l	optabl(pc),a1		;   a1 always points to optabl
	lea.l	mloop(pc),a3		;   a3 always points to mloop

; Main loop (decode) starts here
mloop:	moveq	#0,d0			; 4 Execute appropriate subroutine.
	move.b	(a2)+,d0		; 8 Grab the next opcode and inc pc.
	asl	#2,d0			;10 D0 high word is still zero!
	move.l	0(a1,d0.w),a0		;18 Get address of routine from table
	jmp	(a0)			; 8 Do the subroutine.
					;48 total cycles to decode
	even
optabl:	dc.l	nop00,lxib,staxb,inxb,inrb,dcrb,mvib,rlc
	dc.l	...

Each z80 instruction emulation routine ends with:

	jmp	(a3)

-------------------------------------------------------------------------------
Method 2: The "position-independent" method:

This is slightly quicker, the executable is more than 1500 bytes smaller, and
you get another register to play with in the emulator (a1 in this case).  I
currently use this method (or close to it) in my Spectrum emulator.  The code
is clean and re-entrant.

	move.l	z80ram,a2		;   load pseudopc
	lea.l	mloop(pc),a3		;   a3 always points to mloop
mloop:	moveq	#0,d0			; 4 clear opcode word
	move.b	(a2)+,d0		; 8 get opcode byte
	add.w	d0,d0			; 4 2 bytes per entry
	move.w	optabl(pc,d0.w),d0	;14 get offset of routine
	jmp	optabl(pc,d0.w)		;14 do instruction
					;44 total to decode
	even
optabl:	dc.w	nop00-optabl,lxib-optabl,staxb-optabl,inxb-optabl
	dc.w	inrb-optabl,dcrb-optabl,mvib-optabl,rlc-optabl
	dc.w	...

Each instruction emulation routine ends with:

	jmp	(a3)

-------------------------------------------------------------------------------
Method 3: The "decode-at-end-of-instruction" method:

(There are really 2 methods described here.)  Take either method 1 or method 2.
Instead of ending each emulation routine with "jmp (a3)", end each one with a
complete copy of the code from mloop to the indirect jmp.  There is no longer
a main loop, because each instruction jumps directly to the next one.

This method is slightly faster, takes maybe twice as much code, is clean, and
is re-entrant.  It also saves yet another reserved register, in this case a3.
(Personally, I find that a z80 emulator needs as many free registers as you
can get your fingers on.)

-------------------------------------------------------------------------------
Method 4: The "threaded jsr's" method:

Warning: This method uses self-modifying, non-re-entrant code, and therefore
is not recommended.  This code is hazardous to your cache!  (No flames please
--- read on).

Introduce a 390kb contiguous block of code (called thread) which looks like
this:

thread:		jsr	patch	; 0
		jsr	patch	; 1
		...
		jsr	patch	; 65535
		jmp	thread

That is, there is a jsr instruction for each byte in the z80's address space.
This is in addition to z80ram.

To start the emulator, you transfer control to "thread".  What the "patch"
routine does is to replace the current "jsr patch" with "jsr this_routine",
where this_routine is the emulation routine for the corresponding opcode in
z80ram.  Then patch jmps to the this_routine to execute the instruction and to
return to the next jsr in the thread.  After a while, patch will no longer be
called (except by z80 self modifying code), and every jsr made will be to
emulate a z80 opcode directly.

Whenever a z80 instruction writes to RAM, it patches the corresponding
"jsr this_routine" with "jsr patch".  As a variation, it could patch
"jsr this_routine" with "jsr new_routine", but that would probably be slower
in general.

Advantage:

It would be faster than methods 1 to 3, --- I think, --- especially in the
Spectrum emulator, which has to do a lot of work with every write to RAM to
check for ROM and video RAM anyway.  The main reason for the extra speed is
that it no longer has to decode the opcode on every instruction.  There are
the extra overheads of call and return though, and extra work to do on every
RAM write.

Disadvantages:

1: The code breaks C='s self-modifying code law.  To run on Amiga's with
caches, it would have to either disable the caches or update them manually
after every patch.  The code is extremely dirty, not re-entrant, and
definitely not recommended;

2: You need 390k contiguous memory (plus another 64k somewhere else, plus
whatever else you need for video).

Other characteristics:

Code would run slowly the first time round the loop, then speed up.

--------------------------------------------------------------------------
Method 5: The "replicated code" method.

Warning: This also uses self-modifying, non-re-entrant code and is therefore
not recommended.

Thread consists of 65536 blocks of code, each long enough to emulate the
trickiest z80 instruction.  Initially it contains 65536 copies of patch.  (You
will need A LOT of contiguous memory.)  What patch does is to actually copy
the code for the opcode over itself, then transfer control to the beginning of
itself.  (Tricky, but it can be done.)  Every emulation routine finishes with
a "bra.s next_instr" so they are all really the same length.  That saves the
call and return overhead.

If an emulation routine is too long, then just use a jmp to somewhere.  In
practice, you would probably start with:

	jsr	patch
	bra.s	next_instr

in every slot, rather than a complete copy of patch.  Z80 RAM writes would
copy the above code to the corresponding slot, if necessary, rather than
copying the whole patch routine.

Short of "compiling and optimising", this is the fastest method I can think of,
but it is incredibly space-wasting, self-modifying, extremely dirty, and
definitely not recommended.

--------------------------------------------------------------------------
Method 6: The "threaded vector table" method:

Ok, now to fix the self-modifying code problem.  Take method 4 (threaded jsr's),
but use a 262kb vector table in a private data segment, instead of a thread in
the code segment.

vectors:	dc.l	patch	; 0
		dc.l	patch	; 1
		...
		dc.l	patch	; 65535
		dc.l	jmp_thread

The main instruction loop looks like:

		lea.l	vectors,a0
		lea.l	mloop(pc),a2
mloop:		move.l	(a0)+,a1	;12 cycles
		jmp	(a1)		; 8 cycles

and every instruction finishes with "jmp (a2)".  A0 is acting as a "pseudo-pc"
into the vector table.  Of course patch performs the same functions as before
(except it is no longer self modifying, it just patches a vector).  The vector
table still needs to be updated by every write to Z80 RAM.  The code is
re-entrant provided each task has a separate copy of the vector table.

--------------------------------------------------------------------------
Method 7: The "position-independent threaded vector table" method:

Same as method 6, except that now the private data segment is:

thread:		dc.w	patch-base	; 0
		dc.w	patch-base	; 1
		...
		dc.w	patch-base	; 65535
		dc.w	jmp_thread-base

and the main loop is:

		lea.l	thread,a0
		lea.l	mloop(pc),a1
mloop:		move.w	(a0)+,d0	; 8 cycles
		jmp	base(pc,d0.w)	;14 cycles
base:
patch:		...
op00:		...
op01:		...
jmp_thread:	...

Now it is position-independent, only 128kb contiguous memory, the executable
is 1500 bytes smaller, and it is slightly slower (only by 2 cycles per z80
instruction though).  The code is re-entrant provided each task has a separate
copy of the vector table.

--------------------------------------------------------------------------
Method 8: The "decode-at-end-of-instruction threaded vector table" method:

Same as method 6 except that every opcode emulation routine finishes with:

		move.l	(a0)+,a1
		jmp	(a1)

instead of "jmp (a2)".  Now isn't that faster?  And it saves a2 for more
important things.

Unfortunately you can't do exactly the same thing to method 7 unless you can
write a complete z80 emulator in 256 bytes  8-) .  But you could take method 7
and end each emulation routine with:

mloop:		move.w	(a0)+,d0
		lea.l	base(pc),a1
		jmp	0(a1,d0.w)

instead.  The code is re-entrant provided each task has a separate copy of
the vector table.

--------------------------------------------------------------------------
Personally I'm considering using one of the methods 6, 7 or 8 in the next
version of the Spectrum emulator (probably method 8)  (That is, if I ever get
enough spare time without more interesting things to do.)  I'll probably make
the source public domain.  That will use more Amiga RAM, but should go faster
(I hope).  Any guesses as to which method will be the fastest, and still fit
comfortably in a 512k machine?

Unfortunately I don't think any of the methods (except the first 3) are
suitable for an 8088 emulator because of the huge memory requirements.

I'm interested in any ideas anyone might have along these lines.  The
discussion of "compiling and optimising" is very interesting, but I don't see
how the details would work.  In particular, how do you cope with self-modifying
code, code loaders, overlays etc?


Peter McGavin.  (srwmpnm@wnv.dsir.govt.nz)

Disclaimer:  I haven't tested any of the above ideas (except 1 and 2).  If you
see any bugs, point them out.