Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!nestvx.dec.com!neideck From: neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) Newsgroups: comp.arch Subject: Re: Register usage Message-ID: <8905110956.AA12655@decwrl.dec.com> Date: 11 May 89 09:56:24 GMT Organization: Digital Equipment Corporation Lines: 98 Summary: If your compiler is not antiquated (compared to this one, even gcc looks bad), you can use gazillions of registers. The speedup is on the order of 30%, so don't hold your breath. David W. Wall of Digitals Western Research Lab did exactly the kind of analysis you all are looking for. From Davids paper "Global Register Allocation at Link Time", (SIGPLAN Notices 21,7,pp. 264-275). The benchmarks used were the Livermore loops, Whetstone, LINPACK, the Stanford benchmark suite and two applications, a logic simulator RSIM and a timing verifier. The compiler used was for the not-so-widely-known DECWRL Titan RISC machine, a ECL RISC with 64 non-windowed registers and a single cylce load. The compiler does global register allocation (yes, global variables in the sense of C) at link time and is a common backend for C, Fortran and Modula 2. The number of registers does not apply to the expression evaluation and address generation registers (this seems to be the 4-7 people have been talking so far) but those used by the optimizer to hold things. It analyzes all scalar variables into non-conflicting groups and tries to allocate those to registers. From the introduction: "When we use our method for 52 registers, our benchmarks speed up by 10 to 25 %. Even with only 8 registers, the speedup can be nearly that large, if we use previously collected profile information to guide the allocation. We cannot do much better, because programs whose variables all fit in registers rarely speed up by more than 30 %. Moreover, profiling shows us that we usually remove 60-90% of the loads and stores of scalar variables that the program performs during it's execution." The benchmarks characteristics: Variables Groups coloring coloring + profile ------------------------------------------------------------------- Livermore 166 165 18 % 19 % Whetstone 254 181 10 % 10 % Linpack 214 119 13 % 13 % Stanford 402 211 27 % 28 % Simulator 811 262 15 % 16 % Verifier 1395 693 15 % 19 % Where the columns stand for: Variables: Candidate scalar variables for global register allocation Groups: Overlapping variables of above which need separate registers. If there were that many registers in the machine, they could all be assigned to registers. coloring: Speedup obtained by global register allocation versus register allocation just like what say gcc -O does. coloring + profile: Speedup obtained by guiding the allocator with actual profiles of execution. David measured the number of memory references that could be eliminated in the benchmarks if all scalars could be held in registers and then plotted the precentage his scheme actually removed: coloring coloring + profile ----------------------------------- Livermore 81 % 94 % Whetstone 75 % 88 % Linpack 95 % 99 % Stanford 90 % 98 % Simulator 83 % 95 % Verifier 61 % 83 % This shows that his scheme is very efficient in removing these memory references. Please note that given the enormous "hit rate" he has and given the not so impressive speedups he got the overall precentage of scalar memory references cannot be that big versus accesses to bigger data structures. Now the interesting tables. What happens if you use fewer registers ? The following table shows the speed improvements with 52, 32 and 8 registers. All of these performance measures are the relative improvement the programs took with global register allocation guided by profile information relative to "naive register allocation". 52 32 8 ----------------------------- Livermore 19 % 18 % 12 % Whetstone 10 % 10 % 5 % Linpack 13 % 13 % 10 % Stanford 28 % 27 % 20 % Simulator 16 % 15 % 8 % Verifier 19 % 16 % 7 % There is another very interesting paper by David comparing register window schemes of varying organization with this global allocation stuff and this seems to suggest that a slightly bigger global register file beats register windows if you are willing to use this extremely advanced compilation techniques. The paper appeared in Proc. of the SIGPLAN 1988 Conference on Programming Language Design and Implementation, June 1988. The papers title is "Register Windows vs. Register Allocation". It's way to long to reproduce here and the graphics in there are much nicer than anything I can type here. Burkhard Neidecker-Lutz, Digital CEC Karlsruhe Disclaimer: I don't speak for Digital, etc. ...