Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!uunet!comp.vuw.ac.nz!windy!srwmpnm
From: srwmpnm@windy.dsir.govt.nz
Newsgroups: comp.sys.amiga.emulations
Subject: Re: CPU-emulators
Message-ID: <18878.2801a4ad@windy.dsir.govt.nz>
Date: 9 Apr 91 11:25:31 GMT
Organization: DSIR, Wellington, New Zealand
Lines: 71

Ilja Heitlager (iheitla@cs.vu.nl) wrote:
>I'm planning to write a 6502 (and maybe when I like it some others) emulator.

Good on you!  I've played around with the Z80 emulators for the Amiga, by Ulf
Nordquist and Charlie Gibbs, making them faster.  I have never touched 6502 but
the same techniques should apply.

>At this moment I think there are two ways of doing it:
>	1- Compare every Opcode and jump to a routine which executes the
>	   instruction
>	2- Do it more or less the way the microcode does it.
>	   Ok in software you can't do more operations at the same moment.

I found at several more fundamentally different ways of doing it, and many
variations on those.  So far the fastest practical method seems to be threaded
code.  You can avoid decoding an opcode for every 6502 instruction altogether!
The emulation routine for each 6502 opcode ends with:

		move.l	(a3)+,a0
		jmp	(a0)

So each emulation routine jumps directly to the next emulation routine without
any decoding at all.  Register a3 is acts like a "pseudo pc" into a 256 kbyte
table in which there is a longword pointer to the emulation routine for each
corresponding opcode in the 64 kbyte 6502 address space.

Now, every time the 6502 writes to RAM, you need to update an entry in the
256 kbyte table.  At first it looks as if you have to do an instruction decode
to compute the new table value every time the 6502 writes to RAM.  But in fact
that is not necessary either!

What you do, when the 6502 writes to RAM, is to write a constant address into
the table.  That constant address points to a special routine called "patch".
When patch is called, you finally get to do an instruction decode.  Patch
computes the address of the routine for the current instruction, stuffs it
in the 256 kbyte table, then jumps to the routine for the current instruction.
Next time this instruction is executed, control bypasses patch and goes
directly to the right routine.

A variation of this method which saves memory but is slightly slower, is to use
word offsets in a 128 kbyte table, instead of longword addresses in a 256 kbyte
table.  Each routine ends with:

		move.w	(a3)+,d0
		jmp	0(a2,d0.w)

where a2 holds the base from which all the routine offsets are computed.

This method has more advantages:

1: To handle known ROM entry points, just point the vector for the entry point
at an optimised 68000 routine to do what the ROM routine does.  There is no
overhead at all in checking for ROM entry points.

2: To handle multiple-byte opcodes (e.g, prefix instructions), patch can be made
smart enough to point the vector for the prefix byte to the routine for the
entire instruction.  There is no need to decode opcodes after the prefix every
time the instruction is executed.

3: Patch can be made smart enough to recognise common sequences of 6502
instructions, and to point the vector at an optimised 68000 routine for the
whole sequence.

Note that 2 and 3 above (if implemented) won't correctly emulate certain types
of self-modifying code.

There was a good article on "Portable Fast Direct Threaded Code" by Eliot
Miranda in comp.compilers recently.  He uses GCC to write "machine independent"
threaded code that is just about as efficient as my 68000-specific code.

Hope this helps.  Regards, Peter McGavin.   (srwmpnm@wnv.dsir.govt.nz)