Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: more registers for ix86, was: Let's pretend Message-ID: Date: 8 Jan 91 17:54:01 GMT References: <3042@crdos1.crd.ge.COM> <1990Dec26.020034.4131@lpi.liant.com> <5827@labtam.labtam.oz> <1991Jan6.014925.10935@zoo.toronto.edu> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 94 Nntp-Posting-Host: odin In-reply-to: henry@zoo.toronto.edu's message of 6 Jan 91 01:49:25 GMT On 6 Jan 91 01:49:25 GMT, henry@zoo.toronto.edu (Henry Spencer) said: henry> The major alternatives to explicit register references in henry> instructions are memory-to-memory machines and stack machines henry> ("stack" here in the sense of expression-evaluation stack, not henry> call stack). Memory-to-memory machines lose big on the number of henry> bulky memory addresses needed when expressions start needing henry> temporaries. Stack machines lose big on the difficulty of henry> keeping useful values around for re-use, since the stack really henry> wants to throw a value away after using it. I'd like to repeat my usual fundamental truth on this: the latter statement need not be true, if one has multiple stacks. A little analysis will show why. Statistics exist that show that almost any expression and a majority of code sequences can, on a single threaded CPU (one with one ALU), be compiled without spills with FOUR spare registers[1]. To add abundant inter-expression caching we need another four. This amounts to saying that usually not more than four things try to happen independently in an expression. Statistics exist that show that the average expression is exceedingly simple[2], and frequently resused values are quite few as well. Now, a single stack architecture loses out badly when two conditions are true: 1) The lifetimes of the spills do not nest gracefully. 2) The CPU is not single threaded and has multiple ALUs. A single stack loses badly because in both[3] cases operations can only happen at the top of stack, and the separate logical (case 1) or physical (case 2) computation streams interfere and trash around it. The effect of case 1) is when you see a lot of seemingly unnecessary POP, PUSH and DUP operations used just to rearrange the elements around the top of stack, to bring each computation thread 'in focus' on it. This happens especially often with Fortran style codes, some of which do have complicated expressions and address/value stream interference. The obvious solution is to have multiple tops of stack. The CRISP architecture does this by allowing a short form of address to the first 32 slots on the stack; this seems to work quite well. My favourite solution would be to have multiple stacks. How many? FOUR is the answer, because it is exceedingly rare that an expression contains as many independent threads of computation. A superscalar machine may attach independent ALUs to each stack, or even specialize them[4]. In the old problem of providing accumulators and on-chip cache, the general purpose register file solution is to hide the accumulators and make only the on-chip cache visible (thus register renaming), the stack solution is to have a single accumulator and a stack cache behind it, per stack. Note that the stack solution has the benefit that it has better information encoding for the most common case, as demonstrated by the impressive code densities of stack based architectures[5]. ---------------- [1] Examples: a+b+c+d this, on a single threaded CPU, only needs one register to be compiled without spills. a/b + c*d this instead needs two registers for compiling without spilling, because two things are going on at the same time (logically). [2] Most expressions, staticall, are of the form 'a = b OP c' or simpler; dynamically things cna be slightly more complicated. Note that the operands may involve address computations. [3] Case 1) seems to be relatively rare but important; case 2) is more common than you think, because without being superscalar, many simple implementations have different ALUs for computing values and addresses. [4] Indeed in practice superscalar implementations find it exceedingly hard to find degrees of microparallelism greater than two or three, unless one is talking codes that are best suited to vector, SIMD or MIMD, architectures. Special purpose architectures for special purpose codes! [5] The reason is that most expressions are indeed stack based, i.e. the lifetimes of spills nest nicely, while register machines assume the worst case, i.e. all spills have completely unrelated lifetimes and have to be managed explicitly. A multiple stack machine is a nice compromise. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk