Path: utzoo!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!cs.utexas.edu!uunet!comp.vuw.ac.nz!windy!srwmpnm
From: srwmpnm@windy.dsir.govt.nz
Newsgroups: comp.sys.amiga.emulations
Subject: Re: 8 methods to emulate a Z80
Summary: 4 more methods to emulate a Z80
Keywords: Z80 68000 Amiga Emulator
Message-ID: <18850.27dbf20e@windy.dsir.govt.nz>
Date: 11 Mar 91 21:09:34 GMT
References: <18847.27d80900@windy.dsir.govt.nz>
Organization: DSIR, Wellington, New Zealand
Lines: 212

Here are 4 more methods for doing z80 instruction decoding on a 68000.  These
methods are all based on fixed-size instruction emulation routines.

There are some general comments, on handling multiple-byte opcodes, writing to
hardware registers, and flag handling, at the end of this post.

These 4 methods are faster than methods 1..3 (see previous post), they do not
have the RAM-write overheads of methods 4..8, and they do not require tables of
opcode routine address offsets.  Some of these methods might be superior to all
previous methods.

In all these methods, each z80 instruction emulation routine is fixed at 256
bytes in length.  (See earlier post by Ian Farquhar.)  They are coded in the
sequence op_80, op_81, ... op_ff, op_00, op_01, ... op_7f.  So there is exactly
64k of code.  If a routine is shorter than 256 bytes, the extra space is
wasted.  If a routine is longer than 256 bytes then it will need a jsr to
somewhere.

-------------------------------------------------------------------------------
Method 9: The "self-modifying fixed-size routine" method:

Warning: Self modifying code follows.

This method is almost the same as in Ian Farquhar's earlier c.s.a.e post.

setup:	lea.l	op_00(pc),a1		; a1 always points to op_00
	move.l	z80ram,a2		; load pseudopc

mloop:	move.b	(a2)+,1$-op_00+4(a1)	;16 patch $ff in jmp instruction, inc pc
1$:	jmp	$ff00(a1)		;10 Jump to routine.
					;26 total cycles

Every instruction emulation routine ends with a copy of mloop, rather than a
jump back to mloop.

The move.b patches the high byte of the offset in the second instruction, so
the jump goes to the right routine.  (1$-op_00+4 might be 1$-op_00+2 --- I
haven't checked.)

The code is extremely fast (maybe the fastest yet), but it does not work on
Amigas with memory caches.

I thought of making it even faster by permanently setting a4 to the address of
the byte to patch, and then using:

mloop:	move.b	(a2)+,(a4)		;12 Patch $ff in jmp instruction, inc pc
	jmp	$ff00(a1)		;10 Jump to routine.
					;22 total cycles

but you run out of reserved registers if there are lots of copies of mloop.
(I.e, you can't use the decode-at-end-of-instruction technique.)  Using
"jmp (a3)" at the end of every routine makes it slower.

-------------------------------------------------------------------------------
Method 10: The "standard fixed-size routine" method:

Now we eliminate self-modifying code.

The main loop is coded into the wasted space at the end of op_ff (so that it is
within 128 bytes of op_00 --- Remember that op_00 is in the middle of the 64k
code block).

setup:	subq.l	#2,sp			; make room for scratch area on stack
	clr.w	(sp)			; low byte of scratch area is always 0
	lea.l	mloop(pc),a3		; a3 always points to mloop
	move.l	z80ram,a2		; load pseudopc

mloop:	move.b	(a2)+,(sp)		;12 Opcode to scratch high byte, inc pc
	move.w	(sp),d0			; 8 high byte is opcode, low byte is 0
	jmp	op_00(pc,d0.w)		;14 Jump to routine.

Each z80 instruction emulation routine ends with:

	jmp	(a3)			;10
					;44 total cycles

Unfortunately it's quite a bit slower.  We can do better...

-------------------------------------------------------------------------------
Method 11: The "decode-at-end-of-instruction fixed-size routine" method:

Register a3 always points to op_00 instead of to mloop, and we have:

setup:	subq.l	#2,sp			; make room for scratch area on stack
	clr.w	(sp)			; low byte of scratch area is always 0
	lea.l	op_00(pc),a3		; a3 always points to op_00
	move.l	z80ram,a2		; load pseudopc

mloop:	move.b	(a2)+,(sp)		;12 Opcode to scratch high byte, inc pc
	move.w	(sp),d0			; 8 high byte is opcode, low byte is 0
	jmp	0(a3,d0.w)		;14 Jump to routine.
					;34 total cycles

Each z80 instruction emulation routine ends with a copy of the decode routine.
This is faster than method 10, and mloop can be coded anywhere.

Can we avoid using scratch memory and still be as fast?  Think about how you
might do this before you read on.  An 8-bit shift of register d0 avoids using
scratch memory, but is slower (on a plain 68000).  The next method shows how to
make the decode faster and avoid using scratch memory, but it (possibly)
introduces overhead elsewhere.

-------------------------------------------------------------------------------
Method 12: The "stretched-address-space fixed-size-routine" method:

This method assumes that the z80 address space (z80ram) is stretched to 128k,
so that each byte in the z80's address space takes up a word in the Amiga.
The low order byte of every word must always be 0.

setup:	move.l	z80ram,a2		; load pseudopc
	lea.l	op_00(pc),a3		; a3 always points to op_00

mloop:	move.w	(a2)+,d0		; 8 Opcode to d0 high byte, inc pc
	jmp	0(a3,d0.w)		;14 Jump to routine.
					;22 total cycles

For best results, every routine ends with the mloop code (decode at end of
instruction).  The instruction decode is faster than method 11, but now many
instructions will have extra work to do to convert byte z80 addresses to word
amiga addresses.  Still, this code looks good enough to try.

Miscellaneous hint #9: To convert a byte offset to a word offset, use
"add.w d0,d0", not "lsl.w #1,d0".

Another miscellaneous hint: Maybe there's a use for movep here.

You could maintain 2 copies of the z80 address space --- one in 64k and the
other in 128k.  Then it's just a simple matter of writing a byte to both places
whenever the z80 does a write.  That gets rid of the overhead of converting
between offset types on memory reads.

But now our method is starting to look like threaded code (method 8) again.
The threaded code method uses the 128k block to store the offset to the
handling routine, rather than storing the opcode itself.  The overhead in doing
a memory write is the same in both methods, and threaded code has other
advantages (like not having to pad code to 256 bytes, multiple-byte opcodes
handled better).  So we're back to threaded code again.

-------------------------------------------------------------------------------
A note on multiple-byte opcode instructions:

The z80 uses these.  They are prefixed with $cb, $dd, $ed and $fd.  There are
also $ddcb and $fdcb prefixed instructions.

For fixed-size methods, (methods 9..12), the fastest way to cope with long
instructions is to use more tables.  But that means multiplying the number of
tables by 5 or 7.  64k of code has just jumped to 320k or 448k.  That's no good
on a small machine.  Also, if you reserve a register to point to op_00 in each
table, that's 5 or 7 registers gone.  Oops.

-------------------------------------------------------------------------------
Some notes on threaded code:

I spent last evening trying threaded code (method 8) in the Spectrum emulator. 
(See previous post.)  I got a 20..40% speed improvement over the position-
independent standard method (256-way CASE statement).  It's still several times
slower than a real Spectrum, unfortunately.  There is some slow code in places
where there wasn't before, so there is scope for more improvement.
Unfortunately I introduced some bugs during the systematic changes, and they
are proving hard to track down.  Everything in the Spectrum ROM seems to work
ok.  Some machine-code programs that worked before have stopped working.

Threaded code is extremely fast for multiple-opcode instructions.  Control is
vectored directly to the right routine first time, without having to decode
multiple tables.  A problem with this is that if a Spectrum program overwrites
the second opcode of a multiple-byte instruction ($cb, $dd, $ed, $fd) without
writing to the first byte, then the emulator doesn't cope.

-------------------------------------------------------------------------------
Note on hardware registers:

Handling writes to hardware registers in the Spectrum isn't really very hard.
The z80 has a separate IO address space with a separate set of instructions for
handling it.  (The same is true of the 8088.)  The only thing to watch for is
writing to video RAM.  It is fixed size and at a fixed place, so it takes 2
tests.  (There doesn't seem to be a faster way of doing a single bit test ---
the video RAM doesn't end on a power-of-2 boundary.)  I have a separate task
(sort of) which uses the blitter to keep the screen up-to-date with the video
RAM.  So when there is a write to video RAM, the emulator task doesn't have to
do much.  It just flags the blitter task "Hey, there's something to update in
character row n, when you wake up".  The blitter task doesn't slow the emulator
down much, because it's mostly running on another processor, and it sleeps when
there's nothing to update.

-------------------------------------------------------------------------------
Some notes on flag handling:

Both of the CP/M emulators I know about spend a lot of time handling z80 flags
(condition codes).  After just about every instruction they do a "move sr,d0"
or "move ccr,d0" or call GetCC() to get the 68000 flags, then they do a table
lookup to translate them to z80 format.  After every logical instruction (not,
or, xor etc), a second table lookup is done to set the z80 parity flag.  (The
68000 does not have a parity flag.)  These table lookups are slow.  In fact,
they often take several times as long as the guts of the instruction itself.

Both these table lookups are totally unnecessary!  It's faster to save the
flags in 68000 format (in a register).  Routines that test flags simply test
the corresponding 68000 flag.  For logical instructions, simply save the parity
byte away somewhere, and set another bit in the register to say to use the
parity byte and not v flag.  The parity testing instructions (e.g, "jp po,nn")
look at that bit, and then test either the v flag or the saved parity byte.
The only times you need to translate flags between z80 and 68000 formats are in
"push af" and "pop af" instructions.

I got a 10..20% speedup in my Spectrum emulator this way.

-------------------------------------------------------------------------------
I said in my previous post that threaded code for 8088 emulation would use too
much memory to be practical.  In fact it would be perfectly practical on an
Amiga equipped with 3 Mbytes or more.

Peter McGavin.   (srwmpnm@wnv.dsir.govt.nz)