Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: more registers for ix86, was: Let's pretend
Message-ID: <PCG.91Jan8175401@odin.cs.aber.ac.uk>
Date: 8 Jan 91 17:54:01 GMT
References: <3042@crdos1.crd.ge.COM> <1990Dec26.020034.4131@lpi.liant.com>
	<5827@labtam.labtam.oz> <KENW.91Jan3123808@skyler.calarc.ARC.AB.CA>
	<1991Jan6.014925.10935@zoo.toronto.edu>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 94
Nntp-Posting-Host: odin
In-reply-to: henry@zoo.toronto.edu's message of 6 Jan 91 01:49:25 GMT

On 6 Jan 91 01:49:25 GMT, henry@zoo.toronto.edu (Henry Spencer) said:

henry> The major alternatives to explicit register references in
henry> instructions are memory-to-memory machines and stack machines
henry> ("stack" here in the sense of expression-evaluation stack, not
henry> call stack).  Memory-to-memory machines lose big on the number of
henry> bulky memory addresses needed when expressions start needing
henry> temporaries.  Stack machines lose big on the difficulty of
henry> keeping useful values around for re-use, since the stack really
henry> wants to throw a value away after using it.

I'd like to repeat my usual fundamental truth on this: the latter
statement need not be true, if one has multiple stacks. A little
analysis will show why. Statistics exist that show that almost any
expression and a majority of code sequences can, on a single threaded
CPU (one with one ALU), be compiled without spills with FOUR spare
registers[1].  To add abundant inter-expression caching we need another
four.

This amounts to saying that usually not more than four things try to
happen independently in an expression. Statistics exist that show that
the average expression is exceedingly simple[2], and frequently resused
values are quite few as well.

Now, a single stack architecture loses out badly when two conditions are
true:

1) The lifetimes of the spills do not nest gracefully.

2) The CPU is not single threaded and has multiple ALUs.

A single stack loses badly because in both[3] cases operations can only
happen at the top of stack, and the separate logical (case 1) or
physical (case 2) computation streams interfere and trash around it.

The effect of case 1) is when you see a lot of seemingly unnecessary
POP, PUSH and DUP operations used just to rearrange the elements around
the top of stack, to bring each computation thread 'in focus' on it.
This happens especially often with Fortran style codes, some of which do
have complicated expressions and address/value stream interference.

The obvious solution is to have multiple tops of stack. The CRISP
architecture does this by allowing a short form of address to the first
32 slots on the stack; this seems to work quite well.

My favourite solution would be to have multiple stacks. How many? FOUR
is the answer, because it is exceedingly rare that an expression
contains as many independent threads of computation. A superscalar
machine may attach independent ALUs to each stack, or even specialize
them[4].

In the old problem of providing accumulators and on-chip cache, the
general purpose register file solution is to hide the accumulators and
make only the on-chip cache visible (thus register renaming), the stack
solution is to have a single accumulator and a stack cache behind it,
per stack.

Note that the stack solution has the benefit that it has better
information encoding for the most common case, as demonstrated by the
impressive code densities of stack based architectures[5].

----------------

[1] Examples:

a+b+c+d		this, on a single threaded CPU, only needs one register
		to be compiled without spills.

a/b + c*d	this instead needs two registers for compiling without
		spilling, because two things are going on at the same
		time (logically).

[2] Most expressions, staticall, are of the form 'a = b OP c' or
simpler; dynamically things cna be slightly more complicated. Note that
the operands may involve address computations.

[3] Case 1) seems to be relatively rare but important; case 2) is more
common than you think, because without being superscalar, many simple
implementations have different ALUs for computing values and addresses.

[4] Indeed in practice superscalar implementations find it exceedingly
hard to find degrees of microparallelism greater than two or three,
unless one is talking codes that are best suited to vector, SIMD or
MIMD, architectures. Special purpose architectures for special purpose
codes!

[5] The reason is that most expressions are indeed stack based, i.e. the
lifetimes of spills nest nicely, while register machines assume the
worst case, i.e. all spills have completely unrelated lifetimes and have
to be managed explicitly. A multiple stack machine is a nice compromise.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk