Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Register Count
Message-ID: <PCG.91Jan13174042@odin.cs.aber.ac.uk>
Date: 13 Jan 91 17:40:42 GMT
References: <PCG.91Jan10162301@odin.cs.aber.ac.uk> <11566@pt.cs.cmu.edu>
Sender: cho@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 181
Nntp-Posting-Host: odin
In-reply-to: lindsay@gandalf.cs.cmu.edu's message of 12 Jan 91 20:59:26 GMT

On 12 Jan 91 20:59:26 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said:

lindsay> In article <PCG.91Jan10162301@odin.cs.aber.ac.uk> 
lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Loop unrolling only wins if you have some degree of multiple
pcg> functional units, otherwise you only buy reduced overhead, which is
pcg> not much, especially if the implementation has some degree of
pcg> pipelining.

lindsay> Unrolling (to reduce overhead) is the basic trick of hand-coded
lindsay> memcpy routines, on such "multiple functional unit" classics as
lindsay> the 8080 and 68010. Pipelining makes unrolling more useful, not
lindsay> less.

My fault -- I was not discussing 8080s and 68010s, but contemporary
machines. If there is *no* suitable pipelining, like for those,
unrolling buys you reduced overhead, as I wrote. Not a lot, and you
don't need a lot of unrolling to reduce overhead suitably, and since you
have no pipelining you can reuse the registers in each stage of the
unrolling. I remember having done some experiments with copying core on
a PDP-11/70; unrolling beyond three or four was mostly pointless; with
loops with a large body the advantage was even smaller.

If you have *some limited* degree of pipelining, as in contemporary
implementations, such as the classic three-four stage pipeline that
overlaps some computation with some control, and especially if this
pipeline is exposed with things like delayed branches, then unrolling
buys you nothing at all in time, and loses code space.

Maybe you could get up to date on the "benefits" of loop unrolling for
contemporary, shallow pipelined, implementations if you got my
CoreCopy() function from a comp.arch archive, and tried it on several
contemporary machines with different values for its several options.
Just to inspire your curiosity, I have found that for example on an
implementation with a shallow pipeline like the 68020 unrolling buys you
nothing at all. This is not untypical.

I have been sent an example (thanks!) that shows that on a MIPS R30x0,
which has a shallow implementation, one can keep the floating point
pipeline busy with the use of 6-7 registers and *no* unrolling. This is
not untypical either.

pcg> It is an old controversy, but let me repeat here: 90% of what
pcg> passes for "optimizing" compilers are compilers that optimize for
pcg> *space*, not time, ...

lindsay> The optimizing compilers that I worked on, were trying for
lindsay> speed.  We tuned the things by endlessly recompiling and
lindsay> running our (ever-expanding) code-quality suite. We didn't
lindsay> bother to measure how small the generated code was: we just
lindsay> timed it.

This seems (to me at least) a funny way of proceeding, and deomstrates a
trial-and-error mentality that may be there because of lack of insight.

Well, maybe that's the case. If you were aiming at one goal with
technologies suited to achieving another goal, too bad for you.

It can happen, even to the best; a lot of serious and important people
that publish papers on "optimizing" compilers expound the funny view
that things like graph coloring optimize speed and not space. I don't
know who's the joke on.

The joke is that graph coloring minimizes the *number* of spills, not
their *cost* in time. Most spills have negligible cost in time, because
they are not in critical paths, and are not executed frequently.

If an "optimizer" does not attempt to minimize the *dynamic* cost of
spills, by estimating the frequency of reuse of values either by
feedback profiling or by heuristics, it will only achieve optimal timing
by pure chance, or by having so many registers to play with that the
cost of spills is zero because there aren't any (virtually all the so
called "optimizers" for RISC machines).

I had hoped that more people understood what David Wall and Hank Dietz
can say on this subject (not to mention more esoteric technologies).

pcg> Based on my armchair evidence I believe that most general purpose
pcg> applications data flows can be modelled best with four multiple
pcg> stacks (quite shallow, by the way, say four deep), because such
pcg> dataflows have the shape of a tree with a relatively low branching
pcg> factor, and relatively small number of levels in each independent
pcg> branch.

lindsay> Further, you are referring to floating point calculation.

No, to nonfloat ones, explicitly. Yet I think that for float ones things
are similar, for many, but maybe not most, applications. Several
important numerical applications that (unfortunately) use floating point
have FIFO dataflow reference patterns, and of course a SIMD/MIMD machine
is far better suited to them than one based on multiple stacks, which is
designed for (mythical :->) general purpose applications.

I think that it is in order to point to the classic "The MU5 computer
system", by Ibbett & Morris, MacMillan, for a clear description of how
different must architecture be for different types of expected dataflow
reference patterns. The MU5 was designed for FIFO, numerical codes,
while it ended up running general purpose applications. Baaad vibes,
man.

lindsay> It is just as easy to believe that the modal stack depth would
lindsay> be 1 (one), hence, the multiple stacks would be that many
lindsay> (expensive) registers.

Hey, you are stealing one of my lines here! That you need very few
registers because most dynamically important expressions are terribly
simple is *my* mantra.

I introduced the idea of multiple stacks as a concession to the
important, even if not the modal, case where the dataflow patterns do
have some (limited) complexity, like multiple non nested nests of
computation. Multiple stacks is like flat register file, except that it
takes advantage of the nested nature of each of the strands to allows
for separate stacks. Maybe I should repeat the argument more clearly
(hopefully). Suppose you have determined that you need *up to* 16 spare
registers to cache optimally your dataflow reference patterns, that
exhibit typically *up to* 4 independent strands of computation, each of
them involving maybe *up to* 4 levels of spill.

How should you organize the 16 registers? If you choose a flat register
file you need 4+4(+4) bits to encode the operand numbers and you need,
for an N<=4-way superscalar implementation, an N+k multiported file. If
you use four register stacks with four levels each, you need 2 bits per
operand number and each separate stack can be single ported.

Naturally the flat register file can cope much better with data
reference patterns that are not *up to* four way independent and *up to*
four way deep each, but I surmise that these are relatively rare, for
general purpose applications.

Naturally if you think that you can cache the locality of a code with
less than 16 registers, please use less than those, or if you think that
the data reference patterns exhibit no exploitable nesting, then use a
flat register file with its generality and associated costs.

More research needs to be done on this subject; I do have anectodal
evidence (and have read some papers that seem to imply) that 4x4 is a
good trade off between the number of registers and the profile of most
codes and their exploitable parallelism. I could be persuaded easily
that *most* codes do not need those many, because that's my own mentra,
but then the same argument would apply to the size of a flat register
file.

It is sad that people like you spend lots of time in trial-and-error
attempts to get an "optimizer" going using possibly irrelevant
technology, instead of investigating important issues like the shape of
dataflow reference patterns for various classes of applications. Or
maybe too many products/careers have been built on large register banks
and "optimizers" for this.

My argument, more than being centered on the number four (which I still
believe to be a reasonable estimate), was centered on the idea that a
flat register file does not reflect the expected reference patterns of
general purpose codes as the same number of registers arranged as
multiple stacks.

For certain types of codes, neither arrangements may be best -- vector
and vector oriented superscalars (e.g. the ZS-1) have register *queues*
instead of register stacks or flat files, as the expected dataflow
patterns are FIFO, with no value reuse, but with deep pipelining.

lindsay> But, on the whole, stack architectures are dead,

Just as quantitative computer architecture is, if serious people can
delude themselves that classic graph coloring optimizes run time and not
code space, and there is a scant research on studying the actual
reference patterns of classes of code, instead of building
products/careers on questionable _assumptions_, that seem to work only
because they are in fact never tested.

lindsay> and dealing with a proposed one doesn't seem like a good use of
lindsay> my time.

Yes, probably reading the relevant literature is a better use of your
time. Reread David Wall and company a million times... Or maybe the
David Wall in my world is a pop singer in yours :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk