Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!samsung!spool2.mu.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Register Count
Summary: The Truth about "optimizers", for the millionth time.
	 More interestingly, a treatise on value dataflow patterns,
	 and how they map on common architecture classes.
Message-ID: <PCG.91Jan10162301@odin.cs.aber.ac.uk>
Date: 10 Jan 91 16:23:01 GMT
References: <PCG.91Jan8175401@odin.cs.aber.ac.uk> <11538@pt.cs.cmu.edu>
Sender: cho@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 296
Nntp-Posting-Host: odin
In-reply-to: lindsay@gandalf.cs.cmu.edu's message of 9 Jan 91 05:45:14 GMT

On 9 Jan 91 05:45:14 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay)
said:

lindsay> In article <PCG.91Jan8175401@odin.cs.aber.ac.uk>
lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Statistics exist that show that almost any expression and a
pcg> majority of code sequences can, on a single threaded CPU (one with
pcg> one ALU), be compiled without spills with FOUR spare registers[1].
pcg> To add abundant inter-expression caching we need another four.

lindsay> This directly contradicts recent experience with optimizing
lindsay> compilers.

This is directly supported by all experience with optimizing compilers.
It is an old controversy, but let me repeat here: 90% of what passes for
"optimizing" compilers are compilers that optimize for *space*, not
time, by caching into registers things that are frequently referenced in
the program text, by minimizing spills, not those used in the program
execution. Since on most machines, especially RISC machines, referring
to an operand in main memory takes much more space than if it were in a
register, these compilers stash into registers eveything in sight, and
thus they do achieve code space compaction.

Since most of these compilers make no attempt to estimate the dynamic
frequency of use of values, they achieve good un time performance only
if *all* values are put in registers; as far as time is concerned,
whether most values are in a register or a spilled to memory is
irrelevant.  Fortunately since it is extremely rare[1] that there
are more than a couple dozen values in any one section of code, that all
values are cached can usually be ensured by having a 32 strong register
bank.

When you run a real _time_ optimizing compiler[2] the knee of the curve,
for single threaded non float computations, lies at four spare registers
and four global[3] registers.

lindsay> You appear to have counted floating point values (and assumed
lindsay> no unrolling or software pipelining) and taken that as the
lindsay> contents of the register bank.

I have said *nonfloat* and *single threaded*. In my view of the world,
each functional units needs its own set of registers. Most of the
currently available implementations are nearly single threaded, or in
any case need only few registers[3].  Loop unrolling only wins if you
have some degree of multiple functional units, otherwise you only buy
reduced overhead, which is not much, especially if the implementation
has some degree of pipelining.

lindsay> Aside from the generated-in address computations, there are
lindsay> uses such as scope uplinks, subroutine parameter lists, loop
lindsay> control, and the like. And those are just the conventional
lindsay> uses, not the agressive ones.

I am getting tired repeating that that quantitative analysis of computer
architecture, which existed well before Hennessy & Patterson's book, is
about knees in the two curves, resource/benef	it and resource/cost[4].

My current understanding, based on vast amounts of reading[5] and
analysis, is that, for general purpose applications, the number of
registers is an increasing function of the number of functional units,
that the knee of the curves is at four spare registers (plus possibly
four global ones) for each one, and that the knee of the curve in
functional units is at four as well, as repeated here:

pcg> My favourite solution would be to have multiple stacks. How many? FOUR
pcg> is the answer, because it is exceedingly rare that an expression
pcg> contains as many independent threads of computation. A superscalar
pcg> machine may attach independent ALUs to each stack, or even specialize
pcg> them[4].
pcg> [4] Indeed in practice superscalar implementations find it exceedingly
pcg> hard to find degrees of microparallelism greater than two or three,

I have had comments here and in the past that demonstrate a fundamental
difference in point of view between me and some other people. I base my
numbers on the computation patterns of general purpose applications --
let's say on the data flows of values in them. Characterizing general
purpose applications is far from a static field, but by now we know
quite a lot.

How to map these onto an application is the architect's problem. Some
people hold that the machine should be organized in the most dynamic
fashion to mimic the dataflow of values in any application, and come up
with superscalar dataflow machines. Based on my armchair evidence I
believe that most general purpose applications data flows can be
modelled best with four multiple stacks (quite shallow, by the way, say
four deep), because such dataflows have the shape of a tree with a
relatively low branching factor, and relatively small number of levels
in each independent branch.

If this is true, adding more functional units is not going to be of
benefit, because interference between the dataflows will be high enouggh
to cause too much trouble, and having deeper stack windows is not going
to be terribly useful either, because the stack depth will only rarely
cause its overflow, and that will happen infrequently.

lindsay> That's two or three things _per-clock_, on a _pipelined_
lindsay> machine. You are planning to operate more than one stack per
lindsay> clock?  In a pipelined fashion, so that each stack can be
lindsay> re-referenced while an outstanding computation hasn't yet
lindsay> written back its result?

Why not? Once you have multiple stacks, if the compiler is not demented,
and the application does have a sufficient intrinsic degree of
microparallelism, dataflows on the different stacks will be typically
independent. For example "(a+b) * (c+d)" can easily be mapped to two
stacks to be run in parallel. Independent branches of the dataflow
tree/graph are prime candidates. Just look at programs and data as
though they were written in Lisp :->.

lindsay> How are the multiple consequences resolved?

These are problems that all superscalar architectures must solve.

Architectures with a flat register file model, I believe, end up
simulating multiple stacks in the flat register file, whether they are
superscalar or not, because that is how application data flows are.

Now, _if_ we accept my belief (but I am ready to be converted!) that
general purpose applications dataflows can be characterized as above (a
few, short, independent "nests" of activity), we have the following
alternatives:

	Scalar, register file, load/store

Multiple stacklets are simulated in the register file. Switching the
focus among the stacklets is instantaneous, because action can happen
between any two registers, and any register can be "stack top".
Processing of each thread, that is of each stacklet, is serial. It is
important that the register file be able to cache entire stacklets,
because referring to operands in memory is so awkward. Poor instruction
density, because it does not exploit exploit expected dataflow patterns,
and is designed for the worst case[7].

	Scalar, register file, traditional

Multiple stacklets are simulated on a memory stack, the stacletk tops
(and as many further down as will fit) are kept in CPU registers, the
rest (the "spills") are multiplexed on the memory stack. Switching the
focus among the stacklets is nearly instantaneous, because most of the
action happens anyhow near the stack tops, which are in registers.
Accessing the bottom elements each stacklet is not too expensive, in a
register/memory architecture.  Processing of each thread is serial. Fair
instruction density, because it is fairly close to the multiple nested
value dataflow patterns.

	Scalar, single stack

Multiple stacklets are nested on a single CPU stack. Switching the focus
among the stacks requires pushing/popping entire substacks because
action can happens only at the stack top. Independent threads of action
cause problems because the real stack top must be multiplexed
unpleasantly.  Good instruction density, because it exploits very well
the nested nature of most computation flows[8]. Unfortunately encoding
density is not perfect, because to switch the focus from one stacklet to
another takes a number of push and pop operations. Fortunately many non
numeric applications do not have more than one thread of computation.

	Scalar, multiple stack

There is a small number of CPU stacks, of limited depth, with
overflow/underflow spilling/refilling, either hardware or software
assisted. Each stacklet is mapped to a different hardware stack, and
carries an independent thread of the computation. Typically only two CPU
stacks will be in frequent use[9], even if four will usually be
provided. Four CPU stacks will enable mapping each stacklet for most
applications. A depth of four for each CPU stack will result in no or
rare overflows most of the time.  Switching the focus between stacklets
is instantaneous.  Encoding density is excellent, because it maps
exactly on the expected value dataflow patterns.

	Superscalar, registers, load/store

There are multiple focus points, because each stacklet contained in the
register file can be served by a different functional unit. If[10] the
register file is multiported, access to the independent stacklets is
fast, but a lot of interconnect is required, because the stacklets may
be spread all around and intermingle arbitrarily in the register file,
again as there is too much generality and worst case assumptions. If
there are more functional units than there is space in the register file
for as many stacklets, trouble ensues, unless exoterica are used like
register renaming[11].

	Superscalar, registers, traditional
	Superscalar, single stack

They can be made to work, at the price of converting on the fly the
instruction stream to that for another type of architecture[12]; the
difficulty is that the presence of multiple focuses because of multiple
functional units makes for trouble, as both have to multiplex multiple
stacklets onto a single stack. Since "registers, traditional" only uses
the memory stack for multiplexing the bottoms of the stacklets, it can
be adapted to superscalar operations better than "single stack"[13].

	Superscalar, multiple stack

There are multiple CPU stacks, and each computation stacklet is mapped
onto it.  Each CPU stack need not be multiported; it can even be
dedicated to a single functional unit, even if I wouldn't. Only the
stack tops can be transferred to/from other stacks. Computations in each
CPU stack can proceed independently, because this is how application
data flows are structured. The push and pop operations could have two
forms, a short one for addressing directly the on chip cache (which
would become a statically managed cache, something similar to a register
file, but not quite), or the cache could be a more conventional one, and
push and pop have only full address forms. More functional units that
there are hardware stacks is a bad idea[11].

	Superscalar, dataflow

In this organization the CPU is structured internally as a network of
functional units, and values (and operators) flow in this network.
Unless the functional unit mix required by the application is very
different by that provided in the network, this will dynamically adapt
to any value dataflow patterns exhibited by the application, as well as
to any number and type of functional units, without recompilation.
Unfortunately this requires fairly hefty dynamic scheduling in hardware,
and all this generality and flexibility is rarely needed[14], and this
makes dataflow machines useful only for a few very wild application.


Well, I hope you have followed me down to here. I have been trapped into
spilling the beans on what I think is a correct model of computational
activity and its implication on the structure of an architecture
(register based, either load/store or traditional, and stack based,
either single or multiple), and its implementation (scalar or
superscalar). Hope it has been useful to you as well, especially in
debunking certain myths about optimizers, registers, superscalars, ...

================

[1] But I received by e-mail a report that in some cases there are more
than 128!

[2] One that uses profiling, or even better one that believes 'register'
or uses heuristics -- I think Hank Dietz's group was into this sort of
things -- to estimate the dynamic frequency of use of values.

[3] Spare means mostly for computing expressions, global for aching
values across expressions, not even procedures. Interprocedural caching
is of little benefit except in particular cases.

[4] For example, I have received some interesting mail that shows that
in order to keep the pipeline and functional units of a MIPS R3000/R3010
CPU going full tilt, you only need six floating point registers.

[5] We should not be interested in how many registers we can make use
of, but in how many registers we can productive use of, and what this is
function of, and how much does this cost.

[6] But see David Wall's classic paper about global optimization to see
that in his extreme examples the knee of the curve is at 8 registers if
dynamic frequency is taken into account, versus three times as many if
only static frequency is used.

[7] Any value can interact with any other value, but instead nesting is
so prevalent in actual value dataflow patterns.

[8] Note that a pure stack architecture is load/store too: only push and
pop address memory. It does away with registers, because the operands
are always implicit, so all instructions except push and pop have only
the opcode (a "syllable", in Burroughs terminology, or a "word" in
Forth).

[9] one for addessing calculations, the other for nonfloat calculation;
a third CPU stack will be in use if there are float calculations.

[10] Well, we could remove the "if" and substitute it with an "as" here!

[11] But then if the application value data flow patterns show more than
four independently exploitable nests of computation, it is no longer a
general purpose application as commonly intended. Use a Cray, a DAP, a
Teradata, a Connection Machine or any other SIMD/MIMD thingie.

[12] I seem to remember that this is how modern 370 architectures are
implemented; the 370 interior decor is emulated by a more or less loose
collection of functional units connected by dataflow paths. IBM have
always seen the 370 (or other) interior decor as little more than a
declarative language, to be interpreted on the fly. In the S/38 the
interior decor is actually translated, more or less statically, to
different microcodes on different models.

[13] However for both it is probably better to go the multiple CPU,
rather than the multiple functional unit, route.  Note that,
intriguingly, for most workloads, four is also the knee of the curve for
macroparallelism. Maybe R. A. Wilson's Law of Fives should be a Law of
Fours :-).

[14] Because general purpose application dataflow patterns (a mantra by
now!)  have fairly predictable shapes, as above, and full dataflow
machines do not exploit this advance knowledge. Not that I think of it,
they could, though, probably, by special casing. Could be interesting.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk