Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg
From: mcg@mipon2.intel.com
Newsgroups: comp.arch
Subject: Re: Register usage
Message-ID: <m0fU37i-0001hNC@mipon2.intel.com>
Date: 4 Jun 89 19:27:04 GMT
References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> <m0fRx4x-0001fDC@mipon2.intel.com> <26145@lll-winken.LLNL.GOV>
Sender: news@omepd.UUCP
Reply-To: mcg@mipon2.UUCP (Steven McGeady)
Organization: Intel Corp., Hillsboro
Lines: 46

In article <26145@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
>>The cut line is different on every architecture - 32 is sufficient on the
>>960, but I am not disagreeing with Wall's estimate of 64.  Certainly
>>in floating-point intensive scientific applications dominated by
>>double-precision arithmetic in loops, more registers are needed.  But
>>substantialy more than 64 seems to limit architectural flexibility
>>quite severely.
>The cut line for scratch registers is directly proportional to the memory
>latency.  The longer your latency the more registers, and concurrently
>handled computation, you need to mask it.

This is entirely correct, but doesn't mention several distinctions.  First,
their is a distinction between *addressable* registers (of which the
960 has 32) and *available* registers (of which the 960 architecture has
an undefined number, the K* implementations have 80, and the subsequent
implementations has more than 144).  The distinction is important because
one must decide what the locality of register use is: do you wish to/need to
use the preponderance of your registers in a single routine, or do you
use them across many routines?  In the former case, a larger addressable
register file helps; in the latter case it does not, and may adversely
impact other aspects of the architecture.  Scientific code often falls
into the former category.  Most control code (at least when well-structured
and/or written in HLL) falls into the latter.

Second, memory latency is not simply a function of external bus and speed
and memory wait-states.  It is affected by the size, speed, and behaviour
of on-chip caches.

The balance between caching and registers is tricky.  Caches can never
be big enough, so there is a temptation to replace them with registers
and assume they will be implemented off-chip.  This is a fine choice
for some applications, but not for those with an emphasis on low-cost
computing.  In cost-sensitive applications, off-chip memory is strictly
DRAM, and the customer tunes the cost (aka speed) of the DRAM to her
performance needs.  You need 90% of the rated performance?  Use 1
wait-state; only 70%? Use 3 wait state.  Contrast this with the
equivalent message of other architectures: Don't want SRAM cache? Then
you get 50% performance; Don't want two distinct memory systems?
Then use this special memory that's more expensive and slower than what
you might want to use.  Of course, most comp.arch readers probably
don't actually *build* systems out of these chips, so these questions
are at best boring, and probably irrelevant here.

S. McGeady
Intel Corp.