Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.compilers Subject: Re: Portable Fast Direct Threaded Code Keywords: interpreter, performance, design Message-ID: <23613@as0c.sei.cmu.edu> Date: 4 Apr 91 13:27:21 GMT References: <3035@redstar.cs.qmw.ac.uk> <1991Apr2.014216.25150@watson.ibm.com> <1991Apr2.192125.7464@beaver.cs.washington.edu> <1991Apr3.182334.16164@src.honeywell.com> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: firth@sei.cmu.edu (Robert Firth) Organization: Software Engineering Institute, Pittsburgh, PA Lines: 129 Approved: compilers@iecc.cambridge.ma.us In article <1991Apr3.182334.16164@src.honeywell.com> vestal@SRC.Honeywell.COM (Steve Vestal) writes: >How architecturally dependent is the performance of these techniques >(relative to compiling to native code)? [cost of threaded code on PDP-11, RISC &c] We might have a misunderstanding here, because what I think of as threaded code doesn't have a decoding and interpretation step. But I'll talk of what I know. A program in threaded code is just an array of addresses, possibly interspersed with operands. So the fragment c := a + b becomes something like address of 'load' address of 'a' address of 'load' address of 'b' address of '+' address of 'store' address of 'c' This implies a very simple virtual stack machine - you can get more clever by implementing a virtual register machine. The basic execution thread then does this. We point a global register at the table of addresses, and each primitive has the form treg := treg + address'size ... jump (treg) As you can see, this is great on the PDP-11, since that reduces to one instruction MOV (treg)+,PC ; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH! On a typical RISC machine, it's two cycles, since you can put almost anything in the delay slot(s) after the jump. Now, the load instruction, for instance, would say load: treg := treg + address'size load (treg) into tempreg treg := treg + address'size push (tempreg) onto simulated stack jump (treg) On the PDP-11, that's MOV @(treg)+, -(SP) MOV (treg)+, PC On a RISC, it's much more like L R0, 4(treg) ; get operand address L R0, 0(R0) ; dereference to get operand SUBI SP, #4 ; decrement simulated SP S R0, 0(SP) ; push operand on stack ADDI treg, #8 ; step over two addresses (mine & operands) JR (treg) ; over to you, Moriarty! [to fill delay slots, shuffle the above to 132564] Well, if you have one load delay slot and one branch delay slot, you can fill all three of them, so that's 6 cycles. Given that a typical load is only 1.1 cycles in direct code (90% of the delays filled), this is certainly a lot more than a 2:1 overhead! When you add the cost of a simulated stack (lots of needless loads and stores), I can well believe 10:1 time expansion for simple code. In fact, it was more than that on the PDP-11, if you compared threaded code with direct code from a decent compiler. The big win in the Fortran compiler came from (a) very compact threaded code, and (b) the floating point operations were implemented in software, so the overhead of threaded code was swamped by the cost of floating addition, subtraction &c. Here's the full code of the above example, so you can see for yourself Direct: MOV a, R0 ADD b, R0 MOV R0, c Threaded: MOV @(treg)+, -(SP) MOV (treg)+, PC * MOV @(treg)+, -(SP) * MOV (treg)+, PC * ADD (SP)+,(SP) MOV (treg)+, PC MOV (SP)+, @(treg)+ MOV (treg)+, PC Note that, if you implement a one-address add, you save two instructions, since the *** bit reduces to ADD @(treg)+, (SP) But even then, it's coming out at over 4:1. What architectural features make threaded code more efficient? The fundamental one is main memory that is fast (or not too slow) relative to registers, since you're doing a lot more fetching. Another is a set of address modes with double indirection, since you're accessing most operands one level of indirection further back. And good old autoincrement helps a little, too. Alas, none of that says 'risc', and much of it says '1960s'. Incidentally, if I were to do this again today, I'd definitely simulate a general-register machine and use a subset of the real machine's registers. If you take 8 of them, you then have 8 loads and stores, one for each register, but if you make an absolute rule that nobody even thinks about touching one of those 8 that doesn't belong to him, then all the good tricks about register allocation, slaving &c will still work. If you then implement the operations as one-address general-register, you have again 8 versions (add into R0, add into R1, ...) and lo! you're programming a very familiar old friend. "But that was in another country, and besides, the machine is dead..." Robert Firth -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.