Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!amdcad!crackle!tim From: tim@crackle.amd.com (Tim Olson) Newsgroups: comp.arch Subject: Register usage [was Re: 80486 vs. 68040 code size] Message-ID: <25546@amdcad.AMD.COM> Date: 8 May 89 02:08:05 GMT References: <907@aber-cs.UUCP> Sender: news@amdcad.AMD.COM Reply-To: tim@amd.com (Tim Olson) Distribution: eunet,world Organization: Advanced Micro Devices, Inc. Sunnyvale CA Lines: 80 Summary: Expires: Sender: Followup-To: In article <907@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: | In article <10979@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU writes: | | Highly optimizing compilers have long been able to make very | good use of more than four registers. | | Well, in a long discussion a few months ago in comp.lang.c, nobody has been | able (in a period of two months) to quote any FIGURES that support this and | other urban legends (e.g. Elvis is alive and designed the Z80000 :->). There | are plenty of figures though about the average extreme simplicity of actual | statements/expressions/constructs in algorithmic level languages, which does | not bode well for the usefulness of large register files. OK, here are some figures to play with: The Am29000 calling convention says that global temporary registers are killed across a procedure call, while values in local registers remain alive. When the Am29000 compilers perform lifetime analysis for register assignment, they note which values must be alive across a procedure call and assign those to local registers; the others are assigned to global registers. A static analysis of 495 functions shows that an average of 6.6 global registers and an average of 7.0 local registers are used per function, with the following register-use histogram: (ain't 'awk' wonderful? ;-) 495 total functions ave globals: 6.5596 ave locals: 7.04242 --- globals --- --- locals --- 0: 1.82% (9) 0: 1.01% (5) 1: 0.81% (4) 1: 5.66% (28) 2: 7.27% (36) 2: 9.29% (46) 3: 4.24% (21) 3: 13.94% (69) 4: 10.51% (52) 4: 14.95% (74) 5: 9.90% (49) 5: 8.08% (40) 6: 17.58% (87) 6: 10.71% (53) 7: 15.56% (77) 7: 7.47% (37) 8: 8.08% (40) 8: 3.84% (19) 9: 11.72% (58) 9: 3.84% (19) 10: 4.85% (24) 10: 4.85% (24) 11: 2.83% (14) 11: 3.03% (15) 12: 1.41% (7) 12: 1.01% (5) 13: 0.40% (2) 13: 1.01% (5) 14: 1.21% (6) 14: 1.41% (7) 15: 0.00% (0) 15: 0.81% (4) 16: 0.81% (4) 16: 0.40% (2) 17: 0.20% (1) 17: 0.81% (4) 18: 0.00% (0) 18: 0.40% (2) 19: 0.20% (1) 19: 1.41% (7) 20: 0.40% (2) 20: 0.61% (3) 21: 0.00% (0) 21: 1.62% (8) 22: 0.00% (0) 22: 0.61% (3) 23: 0.00% (0) 23: 0.20% (1) 24: 0.20% (1) 24: 0.00% (0) >24: 0.00% (0) >24: 3.03% (15) With a stack cache, the cost of saving and restoring the live registers across procedure calls is constant up to the spill/fill boundary, while with a simple global register file, the cost depends upon the number of live registers (1 store + 1 load per live register). If we assume an average of 7 live registers and 1.5% to 2.5% calls in the dynamic instruction mix, live register save and restore would account for an extra 21% ([7 stores + 7 loads]/66 instructions between calls) to 35% ([7 stores + 7 loads]/40 instructions between calls) of execution time, assuming that loads and stores take 1 cycle. However, leaf-procedure optimizations (which we also perform in the current Am29000 compilers) can reduce the number of calls which must save live registers by perhaps 30%, resulting in an overall execution time increase of from 14.7% to 24.5%. Now The Am29000 compilers perform CSE at a very low level (we assume that registers are cheap, loads and stores expensive), so we probably have a somewhat higher number of live registers across function calls than other architectures. I'm sure someone at MIPS can supply us with their equivalent numbers. -- Tim Olson Advanced Micro Devices (tim@amd.com)