Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg
From: mcg@mipon2.intel.com
Newsgroups: comp.arch
Subject: Re: Register usage
Message-ID: <m0fU2pe-0001hNC@mipon2.intel.com>
Date: 4 Jun 89 19:08:25 GMT
References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> <m0fRx4x-0001fDC@mipon2.intel.com> <25786@amdcad.AMD.COM>
Sender: news@omepd.UUCP
Reply-To: mcg@mipon2.UUCP (Steven McGeady)
Organization: Intel Corp., Hillsboro
Lines: 75

In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
>| [Re:] huge directly-addressable register files ... in any reasonable
>| implementation, the register file must be multi-ported.  A six-ported register
>| cell is about 5x the size of a single-ported register cell ...
>
>I don't see why this is a reason not to implement large register files. 
>You need to apply transistors where they will do the most good.

This is a truism, a cliche.  The point is that we are, and have always been,
limited in our scope by how many devices could be {effectively, economically,
manufacturably} put on a die.  If one spends 2/3 of one's silicon budget
on a register file, then correspondingly less is available for function
units (e.g. floating-point), caches, and on-chip peripherals.  The question
of whether a large register file will "do the most good" is precisely the
point here, and I have seen no evidence that it is.

>Note
>that processors that attempt to "more fully utilize micro-parallelism"
>also tend to want to have more general-purpose registers available to
>maintain full performance.

As above, I'd would like to see your experimental evidence that suggests
that more than 64 general registers are useful for a preponderance of
embedded applications (which is, I believe, your target market), or
for any application other than multi-precision scientific code.
We have seen that, given currently and projected compiler technology,
an extremely large *addressable* register set is substantially less
useful than a large on-chip cache.  With wide (128-bit) low-latency
transfers from cache to registers overlapped with other operations,
a large register set is not really needed.  Large embedded applications
we have examined show an average of 6 local scalar variables in registers,
with an additional 3-5 temporaries in use.  Additional registers may be
used to cache global variables, and to temporally separate destination
values that could otherwise share registers, thus avoiding scoreboarding
blocks.

And since one can put 5x the cache in the place of the registers that
would be added, the cache makes even more sense as a repository for
values of import that are not currently in registers.


>| This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
>| (actually 256) architectural registers.  In the current implementation they
>| are (I believe) 3-ported, and even now occupy a large amount of the 29k
>| die space.  I believe that they will run into serious problems if they
>| ever attempt to dispatch and execute multiple general instructions per cycle.
>
>Well, we see no problems in either our 2nd or 3rd generation parts...

Well then, do speak up.  I can't imagine how you will effectively increase
the number of ports in the register file without creating an imbalance
between CPU speed and bus (because of poor register/cache balance).
And I can't imagine how you will lower your CPI without it.  But then,
perhaps I'm just not imaginative enough.

>| The 960, on the other hand, exposes 32 general registers architecturally,
>| but because 16 of these are "local", and saved/restored on call/return
>| to/from architecurally-hidden resources, we can easily move from a cheap
>| (1-ported by 4 sets) register implementation, to a very fast one
>| (6-ported by 8+ sets) in high-performance implementations.
>
>So you *will* be looking at large register files (128+, 6-ported) for
>high performance.

No, not at all.  If you read Glenn Hinton's Compcon paper, you will realize
that, while in the first generation we used a ((16*4)+16) single-ported
register file, the subsequent generation includes a 32 register 6-ported
file, and that registers spilled by call and retored on return are
flushed to an on-chip cache capable of storing 8 or more of these register
sets.  The registers are flushed across a 128-bit wide dual bus, so
call and return take only 4 clocks each.

S. McGeady
Intel Corp.