Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg From: mcg@mipon2.intel.com Newsgroups: comp.arch Subject: Re: Register usage Message-ID: Date: 4 Jun 89 19:27:04 GMT References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> <26145@lll-winken.LLNL.GOV> Sender: news@omepd.UUCP Reply-To: mcg@mipon2.UUCP (Steven McGeady) Organization: Intel Corp., Hillsboro Lines: 46 In article <26145@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article mcg@mipon2.UUCP (Steven McGeady) writes: >>The cut line is different on every architecture - 32 is sufficient on the >>960, but I am not disagreeing with Wall's estimate of 64. Certainly >>in floating-point intensive scientific applications dominated by >>double-precision arithmetic in loops, more registers are needed. But >>substantialy more than 64 seems to limit architectural flexibility >>quite severely. >The cut line for scratch registers is directly proportional to the memory >latency. The longer your latency the more registers, and concurrently >handled computation, you need to mask it. This is entirely correct, but doesn't mention several distinctions. First, their is a distinction between *addressable* registers (of which the 960 has 32) and *available* registers (of which the 960 architecture has an undefined number, the K* implementations have 80, and the subsequent implementations has more than 144). The distinction is important because one must decide what the locality of register use is: do you wish to/need to use the preponderance of your registers in a single routine, or do you use them across many routines? In the former case, a larger addressable register file helps; in the latter case it does not, and may adversely impact other aspects of the architecture. Scientific code often falls into the former category. Most control code (at least when well-structured and/or written in HLL) falls into the latter. Second, memory latency is not simply a function of external bus and speed and memory wait-states. It is affected by the size, speed, and behaviour of on-chip caches. The balance between caching and registers is tricky. Caches can never be big enough, so there is a temptation to replace them with registers and assume they will be implemented off-chip. This is a fine choice for some applications, but not for those with an emphasis on low-cost computing. In cost-sensitive applications, off-chip memory is strictly DRAM, and the customer tunes the cost (aka speed) of the DRAM to her performance needs. You need 90% of the rated performance? Use 1 wait-state; only 70%? Use 3 wait state. Contrast this with the equivalent message of other architectures: Don't want SRAM cache? Then you get 50% performance; Don't want two distinct memory systems? Then use this special memory that's more expensive and slower than what you might want to use. Of course, most comp.arch readers probably don't actually *build* systems out of these chips, so these questions are at best boring, and probably irrelevant here. S. McGeady Intel Corp.