Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan.rice.edu!preston From: preston@titan.rice.edu (Preston Briggs) Newsgroups: comp.arch Subject: Re: Compilers taking advantage of architectural enhancements Message-ID: <1990Oct12.151620.11923@rice.edu> Date: 12 Oct 90 15:16:20 GMT References: <3300194@m.cs.uiuc.edu> <1990Oct12.042251.18884@cs.cmu.edu> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 55 In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes: >|> >>Register file - large (around 128 registers, or more) >|> >> Most compilers do not get enough benefit from these to justify >|> >> the extra hardware, or the slowed down register access. >|> > >|> >In the proceedings of Sigplan 90, there's a paper about how to chew >|> >lots of registers. >|> > >|> > Improving Register Allocation for Subscripted Variables >|> > Callahan, Carr, Kennedy >shouldn't loop unrolling burn lots of registers also? when unrolling, >which ceiling will you hit first, the number of registers, or the size >of the I-cache? I shouldn't been been so cavalier when I said "chew lots of registers". I meant "use profitably". Simple unrolling of the inner loop offers little advantage to a fabulous optimizing compiler. If the optimizer can't do software pipelining, then unrolling (if performed correctly) can provide larger chucks of code to schedule across. However, if the loop contains recurrences, then unrolling can't help much. oh yeah. it can save some conditional branches. whoopee The paper I mentioned above is more agressive (and more profitable). They advocate using dependence analysis to detect reuse of array elements. Where there's consistant reuse, they can replace memory references with register references. They can also detect opportunities to unroll *outer* loops and jam the multiple inner loop bodies. This creates more opportunities for holding reused values in registers and also helps solve the problem of scheduling loops with recurrences. On machines like the MIPS and Sparc and 860, they can get factors of 3 improvement using source-source transformations and the stock compiler. These same ideas provide the basis for managing the D-cache. Scott also asked about blowing out the I-cache. It's possible; massive unroll-and-jamming can consume lots of code space. However, the usual limit is the number of registers or the speed of the FPU. All these transformations are intended to avoid computations being memory-bound. Once you're compute-bound (floating-point unit is 100% busy), there's nothing else you can do. This is why I question the need for 100's of registers. READ this paper. It's not optional. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu