Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender From: pardo@cs.washington.edu (David Keppel) Newsgroups: comp.compilers Subject: Re: Portable Fast Direct Threaded Code Keywords: interpreter, performance, design Message-ID: <1991Apr4.171055.5790@beaver.cs.washington.edu> Date: 4 Apr 91 17:10:55 GMT References: <3035@redstar.cs.qmw.ac.uk> <1991Apr2.014216.25150@watson.ibm.com> <1991Apr2.192125.7464@beaver.cs.washington.edu> <1991Apr3.182334.16164@src.honeywell.com> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: pardo@cs.washington.edu (David Keppel) Organization: Computer Science & Engineering, U. of Washington, Seattle Lines: 117 Approved: compilers@iecc.cambridge.ma.us >>>[Threaded code vs. compilation] >pardo@cs.washington.edu (David Keppel) writes: >>[Less than 2:1 performance hit of threading vs. full compilation.] Note also that the reference that claimed 2:1 (Peter M. Kogge, IEEE Computer pp 22-33 March 1982) also attributed part of that ratio to the poor state of compiler optimization. vestal@SRC.Honeywell.COM (Steve Vestal) writes: >[Klint says 2:1 for PDP-11 v. 9:1 for Cyber. > How architecturally dependent are these techniques?] Suppose that the statically-compiled code fragments that are threaded together are called `primitives'. When the execution time of a primitive is large, then the overhead for the interpreter can be large and still have a small effect on performance. The performance of the interpreted code is dominated by the time in a primitive vs. the overhead of moving between primitives. When the execution time of the primitives is small, then the overhead for moving between primitives can be a large fraction of the total execution time. Overhead comes from at least two sources: * Control flow: the address of the the next primitive is loaded from data memory and the processor executes an indirect jump. * Register allocation: a primitive is essentially a function with a fast calling convention (no stack adjustment). Thus, all the traditional problems with interprocedural register allocation. Examples of `large primitives' are ``draw circle'' and ``interpolate spline''. Examplees of small primitives are ``push'', ``add'', etc. * Architectural dependency of control flow Examples: Direct jumps in full compilation: op1 op2 br next // direct jump Indirect jumps for threading for a CISC: op1 op2 br *(r0)+ // load, increment, jump Indirect jumps for threading for a RISC: ld *r0, r1 // scheduled load op1 op2 br *r1 // jump add r1, #4, r1 // delay slot increment Direct jumps in full compilation can frequently use one instruction (a ``near branch'') both to find the address of the next code fragment and perform the control transfer. On a CISC, branches are typically two or three bytes. On a RISC, branches are typically four bytes. The threaded indirect (load, increment, jump) is typically three bytes on a CISC and twelve bytes (three instructions) on a RISC. Direct jumps in full compilation take typically one instruction time. Indirect jumps take at least the following operations: load, increment, jump indirect. On a CISC, the three operations can typically be `folded' in to one instruction. There may be a load penalty of one instruction time but the increment is overlapped, so the total time is three machine units (one `unit' is about one register->register operation). On a RISC, the total penalty is three machine units. Direct jumps take one (I-cache) cycle to fetch both the branch instruction and the address of the branch target. Indirect jumps take a D-cache cycle to fetch the address of the branch target and an I-cache cycle to fetch the branch instruction. Direct jumps can take advantage of instruction prefetching since the address of the next instruction is known at the time that the instruction prefetch reads the direct jump. Threaded indirects require an indirect branch off of a register. Current RISC and CISC machines are about equivalent in that they do little prefetching. Some machines being designed do more prefetching so the threading overhead for them will be greater. * Architectural dependency of register allocation In a machine with a small number of registers, many of the registers are in-use in each primitive and the best possible register allocation will contain many loads and stores. In a machine with a large number of registers, the full-compilation implementation can make much better use of registers than the threaded primitives implementation (again, for small primitives). The savings from full compilation are exactly analagous to the improvements in register allocation from doing inlining of small procedures. * Other points to ponder In some cases the threaded code implementation is substantially smaller than the full-compilation implementation. For large functions or a machine with small caches, the loss of performance from threading might be overwhelmed by the gain in cache performance. On RISC machines, procedure call/return is about twice the cost of other control flow, except for the overhead of register management. Thus, call-dense RISC code from full compilation may behave about the same as threaded code. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.