Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!sun-barr!sun!chiba!khb
From: khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS)
Newsgroups: comp.arch
Subject: Re: Register usage
Message-ID: <107834@sun.Eng.Sun.COM>
Date: 2 Jun 89 18:59:26 GMT
References: <978@aber-cs.UUCP>
Sender: news@sun.Eng.Sun.COM
Reply-To: khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS)
Organization: Sun Microsystems, Mountain View
Lines: 93

In article <978@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>...
>It must be repeated for the Nth time that this is only true if spill
>minimization is of paramount importance; if you look at speed, then most
>spills avoided by global optimizers with large register sets don't make much
>of a difference.

And for the N+1th time, I guess, it must be repeated that the class of
machines for which minimizing spills is quite interesting (and getting
more so all the time). Consider the paper by Hsu,Dehnert, and Bratt in
ASPLOS-III on the Cydra 5, as an example.

>    
>    Upshot: modern compilers can employ as many registers are you can design
>	in.
>
>But pointlessly... And even old compilers can just register everything in

One can, but old compilers tend to do a very bad job of it. 

>sight; if there are many registers, then using an optimizer is not very
>important.  The hard work, as we have just discussed, is to cache *only* the
>variables that matter, for *only* the section of code where they matter (and

Which with software pipeling, trace scheduling and similar techniques
can be quite a long time indeed. On the Cydra 5 a memory write was
assumed to take 26 cycles, two memory references could be initated
EVERY clock, as well as two integer operations and two FP operations.
Since the FP operations required more than one cycle, the instruction
scheduling was quite interesting. With respect to register "live time"
a given register might be required several loop iterations into the future.

>this can be done by the programmer using "register" in C, or by the compiler

Which is often ignored by the compiler...simply because programmers
cannot reasonably guess how the compiler (try it on your multiflow
trace/28 for a bunch of codes and show us what is produced!) will
unroll, split and otherwise contort your code.


>when fed with either "representative" profile data, or with calculations or
>estimations of where hot spots lie).  This tipically requires many less
>registers than minimizing spills regardless of whether they are expensive
>ones or not.

Doing a "spill" (i.e. running out of interconnect) on the Cydra 5
meant your loop ran 10x slower. This is not acceptable to most programmers.


>    Naive rationale for infinite (as long as they are free) registers:
>				  ^^^^^^^^^^^^^^^^^^^^^^^^
>
>Unfortunately they are not free; more registers make the system stiffer, in

True. Which is why folks build windows, small (32) register files,
file pointers (AMD, Gould) and other stuff.

>that they do raise the cost of multithreading, which is where os technology
>is finally heading (Mach, Os/2, etc...), and they do have costs in real
>estate and even, possibly, cycle time lengthening (Cray's law). You only
>need a handful of register to capture most of the benefit of expression
>optimization, and another to capture most of the benefits of intra statement
>optimization (whether you do it via "register" in C or leave it to the
>compiler).

Your assertation about multithreading is quite true, it is here to
stay. It is far from clear that transputer type designs will win (tiny
machine fast communication) out over somewhat "chunkier" designs.

But the assertation about a handful of registers being sufficient on
high performance machines is simply not borne out. All of Seymour's
machines have a bunch (don't forget those vector registers), and this
is NECESSARY for those long pipes (superpiplining ?) ... and it is
just as true for software pipelining.


>Large register banks are only justified for special purpose machines
>(vector, VLIW, superscalar) where the only thing that matters is raw speed
>in processing batched numeric codes where there is an inherent high degree
>of parallelism in the algorithms employed.

Multiflow claims that they eat "general" code just fine. As Mashy has
pointed out there are several superscalar projects running around ...
and business codes, database codes, and windowing systems benefit from
that kind of parallelism just like numeric codes (although writing the
code in C makes it much harder to extract the parallelism).


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)