Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!mipos3!omepd!mcg
From: mcg@mipon2.intel.com
Newsgroups: comp.arch
Subject: Re: Register usage
Message-ID: <m0fRx4x-0001fDC@mipon2.intel.com>
Date: 30 May 89 00:35:34 GMT
References: <259@mindlink.UUCP> <25382@ames.arc.nasa.gov>
Sender: news@omepd.UUCP
Reply-To: mcg@mipon2.UUCP (Steven McGeady)
Organization: Intel Corp., Hillsboro
Lines: 43

In article <25382@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>
>>say).  In the extreme case of the computer someone mentioned that had 256
>>registers, 
>
>I agree that 256 registers were probably too many, because the compiler had
>trouble using more than about 60 typically.
>
>I am not sure what you mean by "slow programs down".  Certainly, RISC machines
>have bigger code than compact code CISC's like VAX and NS32000.  But the RISC
>machines have generally been significantly faster when implemented in the 
>same techology than the corresponding CISC machines.

One thing that no one has yet pointed out is that a reason not to implement
huge directly-addressable register files is that, in any reasonable
implementation, the register file must be multi-ported.  A six-ported register
cell is about 5x the size of a single-ported register cell or a cache RAM
cell.  To more fully utilize micro-parallelism in an architecture, more
sources and results need to be fetched from the register file simultaneously,
thus the additional ports.

This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
(actually 256) architectural registers.  In the current implementation they
are (I believe) 3-ported, and even now occupy a large amount of the 29k
die space.  I believe that they will run into serious problems if they
ever attempt to dispatch and execute multiple general instructions per cycle.

The 960, on the other hand, exposes 32 general registers architecturally,
but because 16 of these are "local", and saved/restored on call/return
to/from architecurally-hidden resources, we can easily move from a cheap
(1-ported by 4 sets) register implementation, to a very fast one
(6-ported by 8+ sets) in high-performance implementations.

The cut line is different on every architecture - 32 is sufficient on the
960, but I am not disagreeing with Wall's estimate of 64.  Certainly
in floating-point intensive scientific applications dominated by
double-precision arithmetic in loops, more registers are needed.  But
substantialy more than 64 seems to limit architectural flexibility
quite severely.

S. McGeady
Intel Corp.