Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!cs.utexas.edu!sun-barr!sun!imagen!atari!portal!cup.portal.com!bcase
From: bcase@cup.portal.com (Brian bcase Case)
Newsgroups: comp.arch
Subject: Re: Register usage
Message-ID: <18965@cup.portal.com>
Date: 30 May 89 17:32:38 GMT
References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov> <m0fRx4x-0001fDC@mipon2.intel.com>
Organization: The Portal System (TM)
Lines: 112

>A six-ported register cell is about 5x the size of a single-ported
>register cell or a cache RAM cell.

This could be, depending on technology details, of course.  (The width
of metal dominates, usually?)

>This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
>(actually 256) architectural registers.  In the current implementation they
>are (I believe) 3-ported, and even now occupy a large amount of the 29k
>die space.  I believe that they will run into serious problems if they
>ever attempt to dispatch and execute multiple general instructions per cycle.

I can speak to this point with a little authority.  The register file in the
current implementation is indeed 3-ported.  I must admit that I have had
second thoughts about making the register file so big.  The advantages are
many, but the cost of additional ports is indeed bigger than for other
architectures (boy, the '386 architecture has a leg up here! :-).

The current 29K implementation has about 1/5 of the usable (i.e., non-pad
ring) die area dedicated to the register file using 1.26 (Dave Witt:
what is it really?) micron technology.
At about 1 micron, that will shrink to about 1/9 of the usable area, and at
.8 micron, about 1/11.  Increasing the number of ports to 6, lets say, will
increase the size about a factor of 1-2/3 (3x a single-ported cell to
(using Steve's number) about 5x a single-ported cell).  Thus, at 1 micron,
the register file will use about 18% of the useable die area, and at .8 micron
about 15% of the area.  This is not an insignificant amount of area, but it
is not "too" much in my opinion.  Why?  Because having lots of registers IS
GOOD.  If the 29K spends more area on a 6-ported data storage resource than
other processors, I think it's an advantage as long as its not taking
"too" much area!  Up to a point, I would rather spend area on a 6-ported
resource than a 1-ported resource (cache).  I guess we are talking about
where is the "point" in "up to a point."

On the other hand, a 6-ported, 32-register file would be
about (I am guessing by simple scaling)
2% to 5% of the usable die area.  On the 960, some of the
13% difference is used for the register file "backing store", but not much,
say another 3% (I dunno what it really is).  So, the 960 has a 10% "bonus"
chunk of die area.  What can be done with it?  At some points on the technology
curve, it will have a larger cache than the 29K at the same point.  At some
other points, the 10% diea area will "only" result in a smaller die because
10% isn't enough to increase the size of a cache or an FP somethingorother
(but 10% smaller die are cheaper die, depending on yield, greediness,etc.).

So the 960 seems to have a slight implementation advantage.  What's the real
difference?  I don't know because 10% is a small enough amount that it can
be lost in the "noise" of implementation!:  If one guy uses automatic tools
and the other uses full custom deisgn/layout, some difference will result.
Also, just due to "the way things are" the 10% might not be usable.  Some
die are square while some are rectangular because of "the ways things are."

I am making this argument in full recognition of an argument about CISCs
that I used to believe:  "CISCs will always have an implementation
disadvantage becuase of the microcode ROM."  This is bogus for the same
reason:  as the technology improves, the ROM itself shrinks until it can no
longer be seen with the naked eye!  What constitutes either a current RISC
processor or a current CISC processor (the PROCESSOR pipeline not the caches,
TLBs, etc.) would be a very small corner on the die if implemented in the
technology of 1995.

However, what constitutes a current RISC or CISC processor will be
wholely uninteresting in 1995.  For the implementations to come, including
superscalar stuff, the issues will be the complexity of implementation and
the cost (read:  people) of realizing that implementation.  This is where
CISCs, I believe, will faulter.  One of the great advantages of RISCs is 
that they are conceptually easier to implement.  This effect is compounded
with increasing ambitions for greater performance:  the fewer interactions
between instructions the easiser a multi-instruction-per-cycle
implementation will be to construct.  In 1995, there might be another
reactionary simplification movment in computer architecture; maybe current
RISCs are too complex!  Too many special cases!  But I digress...

Size will still matter, but I believe it will be dominated by the many
connections (buses) and small structures required to handle the special
cases and resource interactions, not the larger, regular structures like
register files and caches (although we will still be trying to fit as much
cache as possible).  Thus, the 14-ported 192-register file of the 29K will
still be larger than the 14-ported 32-register files of the 960, MIPS, i860,
etc., but it won't matter because the 29K's register file will be 1% of the
die while the 960's will be 0.1%.  (BTW, the register file will be the center
of the processor, not at one end of the data path as it is now.)
So I believe.

>The 960, on the other hand, exposes 32 general registers architecturally,
>but because 16 of these are "local", and saved/restored on call/return
>to/from architecurally-hidden resources, we can easily move from a cheap
>(1-ported by 4 sets) register implementation, to a very fast one
>(6-ported by 8+ sets) in high-performance implementations.

This indeed gives more flexibility in implementation choices.  The 29K's
register file gives more flexibility in choices for using the register file.
Only time will tell if more advantage is gained from having implementation
choices or from use choices.  If the belief that, in the end, business
issues dominate, maybe the business issue of cost is more important.

>The cut line is different on every architecture - 32 is sufficient on the
>960, but I am not disagreeing with Wall's estimate of 64.  Certainly
>in floating-point intensive scientific applications dominated by
>double-precision arithmetic in loops, more registers are needed.  But
>substantialy more than 64 seems to limit architectural flexibility
>quite severely.

That should be "more than 64 seems to limit *implementation* flexibility."
The 29K has more architectural flexibility (the register file can be used
as a stack cache, a flat pool of 192 registers, or as a few pools of a
smaller number of registers.  Is this important?  I dunno yet.).

These are just a few of my opinions mixed up with some pseudo-facts.  Don't
believe any of them!

"I did it my way."  - Sinatra.