Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan.rice.edu!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Compilers taking advantage of architectural enhancements
Message-ID: <1990Oct12.151620.11923@rice.edu>
Date: 12 Oct 90 15:16:20 GMT
References: <3300194@m.cs.uiuc.edu> <AGLEW.90Oct11222801@treflan.crhc.uiuc.edu> <1990Oct12.042251.18884@cs.cmu.edu>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 55

In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes:

>|> >>Register file - large (around 128 registers, or more)
>|> >>    Most compilers do not get enough benefit from these to justify
>|> >>    the extra hardware, or the slowed down register access.
>|> >
>|> >In the proceedings of Sigplan 90, there's a paper about how to chew
>|> >lots of registers.
>|> >
>|> >	Improving Register Allocation for Subscripted Variables
>|> >	Callahan, Carr, Kennedy

>shouldn't loop unrolling burn lots of registers also?  when unrolling,
>which ceiling will you hit first, the number of registers, or the size
>of the I-cache?

I shouldn't been been so cavalier when I said "chew lots of registers".
I meant "use profitably".

Simple unrolling of the inner loop offers little advantage to
a fabulous optimizing compiler.  If the optimizer can't do software
pipelining, then unrolling (if performed correctly) can provide larger
chucks of code to schedule across.  However, if the loop contains
recurrences, then unrolling can't help much.

oh yeah.  it can save some conditional branches.  whoopee

The paper I mentioned above is more agressive (and more profitable).
They advocate using dependence analysis to detect reuse of array elements.
Where there's consistant reuse, they can replace memory references
with register references.  They can also detect opportunities to
unroll *outer* loops and jam the multiple inner loop bodies.
This creates more opportunities for holding reused values in registers
and also helps solve the problem of scheduling loops with recurrences.

On machines like the MIPS and Sparc and 860, they can get factors
of 3 improvement using source-source transformations and the stock
compiler.

These same ideas provide the basis for managing the D-cache.

Scott also asked about blowing out the I-cache.

It's possible; massive unroll-and-jamming can consume lots of code
space.  However, the usual limit is the number of registers
or the speed of the FPU.  All these transformations are intended
to avoid computations being memory-bound.  Once you're compute-bound
(floating-point unit is 100% busy), there's nothing else you can do.
This is why I question the need for 100's of registers.

READ this paper.  It's not optional.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu