Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg From: mcg@mipon2.intel.com Newsgroups: comp.arch Subject: Re: Register usage Message-ID: Date: 30 May 89 00:35:34 GMT References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> Sender: news@omepd.UUCP Reply-To: mcg@mipon2.UUCP (Steven McGeady) Organization: Intel Corp., Hillsboro Lines: 43 In article <25382@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes: > >>say). In the extreme case of the computer someone mentioned that had 256 >>registers, > >I agree that 256 registers were probably too many, because the compiler had >trouble using more than about 60 typically. > >I am not sure what you mean by "slow programs down". Certainly, RISC machines >have bigger code than compact code CISC's like VAX and NS32000. But the RISC >machines have generally been significantly faster when implemented in the >same techology than the corresponding CISC machines. One thing that no one has yet pointed out is that a reason not to implement huge directly-addressable register files is that, in any reasonable implementation, the register file must be multi-ported. A six-ported register cell is about 5x the size of a single-ported register cell or a cache RAM cell. To more fully utilize micro-parallelism in an architecture, more sources and results need to be fetched from the register file simultaneously, thus the additional ports. This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 (actually 256) architectural registers. In the current implementation they are (I believe) 3-ported, and even now occupy a large amount of the 29k die space. I believe that they will run into serious problems if they ever attempt to dispatch and execute multiple general instructions per cycle. The 960, on the other hand, exposes 32 general registers architecturally, but because 16 of these are "local", and saved/restored on call/return to/from architecurally-hidden resources, we can easily move from a cheap (1-ported by 4 sets) register implementation, to a very fast one (6-ported by 8+ sets) in high-performance implementations. The cut line is different on every architecture - 32 is sufficient on the 960, but I am not disagreeing with Wall's estimate of 64. Certainly in floating-point intensive scientific applications dominated by double-precision arithmetic in loops, more registers are needed. But substantialy more than 64 seems to limit architectural flexibility quite severely. S. McGeady Intel Corp.