Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.compilers
Subject: Re: Portable Fast Direct Threaded Code
Keywords: interpreter, performance, design
Message-ID: <1991Apr4.171055.5790@beaver.cs.washington.edu>
Date: 4 Apr 91 17:10:55 GMT
References: <3035@redstar.cs.qmw.ac.uk> <1991Apr2.014216.25150@watson.ibm.com> <1991Apr2.192125.7464@beaver.cs.washington.edu> <1991Apr3.182334.16164@src.honeywell.com>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: pardo@cs.washington.edu (David Keppel)
Organization: Computer Science & Engineering, U. of Washington, Seattle
Lines: 117
Approved: compilers@iecc.cambridge.ma.us

>>>[Threaded code vs. compilation]

>pardo@cs.washington.edu (David Keppel) writes:
>>[Less than 2:1 performance hit of threading vs. full compilation.]

Note also that the reference that claimed 2:1 (Peter M. Kogge, IEEE
Computer pp 22-33 March 1982) also attributed part of that ratio to the
poor state of compiler optimization.


vestal@SRC.Honeywell.COM (Steve Vestal) writes:
>[Klint says 2:1 for PDP-11 v. 9:1 for Cyber.
> How architecturally dependent are these techniques?]

Suppose that the statically-compiled code fragments that are threaded
together are called `primitives'.


When the execution time of a primitive is large, then the overhead for the
interpreter can be large and still have a small effect on performance.
The performance of the interpreted code is dominated by the time in a
primitive vs. the overhead of moving between primitives.

When the execution time of the primitives is small, then the overhead for
moving between primitives can be a large fraction of the total execution
time.  Overhead comes from at least two sources:

 * Control flow: the address of the the next primitive is loaded
   from data memory and the processor executes an indirect jump.

 * Register allocation: a primitive is essentially a function with
   a fast calling convention (no stack adjustment).  Thus, all the
   traditional problems with interprocedural register allocation.

Examples of `large primitives' are ``draw circle'' and ``interpolate
spline''.  Examplees of small primitives are ``push'', ``add'', etc.


* Architectural dependency of control flow

Examples:

	Direct jumps in full compilation:

		op1
		op2
		br next		// direct jump
	
	Indirect jumps for threading for a CISC:

		op1
		op2
		br *(r0)+	// load, increment, jump

	Indirect jumps for threading for a RISC:

		ld *r0, r1	// scheduled load
		op1
		op2
		br *r1		// jump
		add r1, #4, r1	// delay slot increment

Direct jumps in full compilation can frequently use one instruction (a
``near branch'') both to find the address of the next code fragment and
perform the control transfer.  On a CISC, branches are typically two or
three bytes.  On a RISC, branches are typically four bytes.  The threaded
indirect (load, increment, jump) is typically three bytes on a CISC and
twelve bytes (three instructions) on a RISC.

Direct jumps in full compilation take typically one instruction time.
Indirect jumps take at least the following operations: load, increment,
jump indirect.  On a CISC, the three operations can typically be `folded'
in to one instruction.  There may be a load penalty of one instruction
time but the increment is overlapped, so the total time is three machine
units (one `unit' is about one register->register operation).  On a RISC,
the total penalty is three machine units.

Direct jumps take one (I-cache) cycle to fetch both the branch instruction
and the address of the branch target.  Indirect jumps take a D-cache cycle
to fetch the address of the branch target and an I-cache cycle to fetch
the branch instruction.

Direct jumps can take advantage of instruction prefetching since the
address of the next instruction is known at the time that the instruction
prefetch reads the direct jump.  Threaded indirects require an indirect
branch off of a register.  Current RISC and CISC machines are about
equivalent in that they do little prefetching.  Some machines being
designed do more prefetching so the threading overhead for them will be
greater.


* Architectural dependency of register allocation

In a machine with a small number of registers, many of the registers are
in-use in each primitive and the best possible register allocation will
contain many loads and stores.  In a machine with a large number of
registers, the full-compilation implementation can make much better use of
registers than the threaded primitives implementation (again, for small
primitives).  The savings from full compilation are exactly analagous to
the improvements in register allocation from doing inlining of small
procedures.


* Other points to ponder

In some cases the threaded code implementation is substantially smaller
than the full-compilation implementation.  For large functions or a
machine with small caches, the loss of performance from threading might be
overwhelmed by the gain in cache performance.

On RISC machines, procedure call/return is about twice the cost of other
control flow, except for the overhead of register management.  Thus,
call-dense RISC code from full compilation may behave about the same as
threaded code.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.