Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, performance, design
Message-ID: <23613@as0c.sei.cmu.edu>
Date: 4 Apr 91 13:27:21 GMT
References: <3035@redstar.cs.qmw.ac.uk> <1991Apr2.014216.25150@watson.ibm.com> <1991Apr2.192125.7464@beaver.cs.washington.edu> <1991Apr3.182334.16164@src.honeywell.com>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: firth@sei.cmu.edu (Robert Firth)
Organization: Software Engineering Institute, Pittsburgh, PA
Lines: 129
Approved: compilers@iecc.cambridge.ma.us

In article <1991Apr3.182334.16164@src.honeywell.com> vestal@SRC.Honeywell.COM (Steve Vestal) writes:

>How architecturally dependent is the performance of these techniques
>(relative to compiling to native code)?

[cost of threaded code on PDP-11, RISC &c]

We might have a misunderstanding here, because what I think of as threaded
code doesn't have a decoding and interpretation step.  But I'll talk of
what I know.

A program in threaded code is just an array of addresses, possibly
interspersed with operands.  So the fragment

	c := a + b

becomes something like

	address of 'load'
	address of 'a'
	address of 'load'
	address of 'b'
	address of '+'
	address of 'store'
	address of 'c'

This implies a very simple virtual stack machine - you can get more clever
by implementing a virtual register machine.

The basic execution thread then does this.  We point a global register at
the table of addresses, and each primitive has the form

	treg := treg + address'size
	...
	jump (treg)

As you can see, this is great on the PDP-11, since that reduces to one
instruction

	MOV (treg)+,PC 	; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH!

On a typical RISC machine, it's two cycles, since you can put almost
anything in the delay slot(s) after the jump.

Now, the load instruction, for instance, would say

load:	treg := treg + address'size
	load (treg) into tempreg
	treg := treg + address'size
	push (tempreg) onto simulated stack
	jump (treg)

On the PDP-11, that's

	MOV @(treg)+, -(SP)
	MOV (treg)+, PC

On a RISC, it's much more like

	L R0, 4(treg)		; get operand address
	L R0, 0(R0)		; dereference to get operand
	SUBI SP, #4		; decrement simulated SP
	S R0, 0(SP)		; push operand on stack
	ADDI treg, #8		; step over two addresses (mine & operands)
	JR (treg)		; over to you, Moriarty!

[to fill delay slots, shuffle the above to 132564]

Well, if you have one load delay slot and one branch delay slot, you can
fill all three of them, so that's 6 cycles.  Given that a typical load is
only 1.1 cycles in direct code (90% of the delays filled), this is
certainly a lot more than a 2:1 overhead!  When you add the cost of a
simulated stack (lots of needless loads and stores), I can well believe
10:1 time expansion for simple code.

In fact, it was more than that on the PDP-11, if you compared threaded
code with direct code from a decent compiler.  The big win in the Fortran
compiler came from (a) very compact threaded code, and (b) the floating
point operations were implemented in software, so the overhead of threaded
code was swamped by the cost of floating addition, subtraction &c.

Here's the full code of the above example, so you can see for yourself

Direct:
	MOV a, R0
	ADD b, R0
	MOV R0, c

Threaded:
	MOV @(treg)+, -(SP)
	MOV (treg)+, PC
*	MOV @(treg)+, -(SP)
*	MOV (treg)+, PC
*	ADD (SP)+,(SP)
	MOV (treg)+, PC
	MOV (SP)+, @(treg)+
	MOV (treg)+, PC

Note that, if you implement a one-address add, you save two instructions,
since the *** bit reduces to

	ADD @(treg)+, (SP)

But even then, it's coming out at over 4:1.

What architectural features make threaded code more efficient?  The
fundamental one is main memory that is fast (or not too slow) relative to
registers, since you're doing a lot more fetching.  Another is a set of
address modes with double indirection, since you're accessing most
operands one level of indirection further back.  And good old
autoincrement helps a little, too.  Alas, none of that says 'risc', and
much of it says '1960s'.

Incidentally, if I were to do this again today, I'd definitely simulate a
general-register machine and use a subset of the real machine's registers.
If you take 8 of them, you then have 8 loads and stores, one for each
register, but if you make an absolute rule that nobody even thinks about
touching one of those 8 that doesn't belong to him, then all the good
tricks about register allocation, slaving &c will still work.  If you then
implement the operations as one-address general-register, you have again 8
versions (add into R0, add into R1, ...) and lo! you're programming a very
familiar old friend.

"But that was in another country, and besides, the machine is dead..."

Robert Firth
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.