Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg From: mcg@mipon2.intel.com Newsgroups: comp.arch Subject: Re: Register usage Message-ID: Date: 4 Jun 89 19:08:25 GMT References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> <25786@amdcad.AMD.COM> Sender: news@omepd.UUCP Reply-To: mcg@mipon2.UUCP (Steven McGeady) Organization: Intel Corp., Hillsboro Lines: 75 In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article mcg@mipon2.UUCP (Steven McGeady) writes: >| [Re:] huge directly-addressable register files ... in any reasonable >| implementation, the register file must be multi-ported. A six-ported register >| cell is about 5x the size of a single-ported register cell ... > >I don't see why this is a reason not to implement large register files. >You need to apply transistors where they will do the most good. This is a truism, a cliche. The point is that we are, and have always been, limited in our scope by how many devices could be {effectively, economically, manufacturably} put on a die. If one spends 2/3 of one's silicon budget on a register file, then correspondingly less is available for function units (e.g. floating-point), caches, and on-chip peripherals. The question of whether a large register file will "do the most good" is precisely the point here, and I have seen no evidence that it is. >Note >that processors that attempt to "more fully utilize micro-parallelism" >also tend to want to have more general-purpose registers available to >maintain full performance. As above, I'd would like to see your experimental evidence that suggests that more than 64 general registers are useful for a preponderance of embedded applications (which is, I believe, your target market), or for any application other than multi-precision scientific code. We have seen that, given currently and projected compiler technology, an extremely large *addressable* register set is substantially less useful than a large on-chip cache. With wide (128-bit) low-latency transfers from cache to registers overlapped with other operations, a large register set is not really needed. Large embedded applications we have examined show an average of 6 local scalar variables in registers, with an additional 3-5 temporaries in use. Additional registers may be used to cache global variables, and to temporally separate destination values that could otherwise share registers, thus avoiding scoreboarding blocks. And since one can put 5x the cache in the place of the registers that would be added, the cache makes even more sense as a repository for values of import that are not currently in registers. >| This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 >| (actually 256) architectural registers. In the current implementation they >| are (I believe) 3-ported, and even now occupy a large amount of the 29k >| die space. I believe that they will run into serious problems if they >| ever attempt to dispatch and execute multiple general instructions per cycle. > >Well, we see no problems in either our 2nd or 3rd generation parts... Well then, do speak up. I can't imagine how you will effectively increase the number of ports in the register file without creating an imbalance between CPU speed and bus (because of poor register/cache balance). And I can't imagine how you will lower your CPI without it. But then, perhaps I'm just not imaginative enough. >| The 960, on the other hand, exposes 32 general registers architecturally, >| but because 16 of these are "local", and saved/restored on call/return >| to/from architecurally-hidden resources, we can easily move from a cheap >| (1-ported by 4 sets) register implementation, to a very fast one >| (6-ported by 8+ sets) in high-performance implementations. > >So you *will* be looking at large register files (128+, 6-ported) for >high performance. No, not at all. If you read Glenn Hinton's Compcon paper, you will realize that, while in the first generation we used a ((16*4)+16) single-ported register file, the subsequent generation includes a 32 register 6-ported file, and that registers spilled by call and retored on return are flushed to an on-chip cache capable of storing 8 or more of these register sets. The registers are flushed across a 128-bit wide dual bus, so call and return take only 4 clocks each. S. McGeady Intel Corp.