Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!pyramid!voder!apple!bcase
From: bcase@apple.UUCP (Brian Case)
Newsgroups: comp.arch
Subject: Re: register windows
Message-ID: <6681@apple.UUCP>
Date: Mon, 9-Nov-87 13:28:08 EST
Article-I.D.: apple.6681
Posted: Mon Nov  9 13:28:08 1987
Date-Received: Wed, 11-Nov-87 07:10:19 EST
References: <230@usl-pc.UUCP>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc., Cupertino, USA
Lines: 69

In article <230@usl-pc.UUCP> jpdres10@usl-pc.UUCP (Green Eric Lee) writes:
>I've been following this conversation for some time. I wonder, how
>much does the adder in the register path (to add the register stack
>pointer to the instruction's desired register) slow things down in the
>AMD29000? It might also seem like if you have register windows, you
>have the same thing sitting there in the middle of the address path
>between instruction latch and register addressing, but one scheme I've
>read about is to make your register windows out of shift-registers,
>i.e. no adding, as far as the machine is concerned, you just have your
>16 registers out there or whatever, and when a procedure call is made,
>or a return made, shift the current registers down or up (with some
>provisions for stack overflow). And, of course, Plain Old Registers
>have no problem at all with something out there in the register
>addressing path, except, of course, the decode tree... it seems
>offhand that the MIPS approach, with lots of registers and a very
>smart compiler capable of allocating them to procedures as needed,
>might offer some potential future speed advantages over traditional
>(non-shift-register) register windows or AMD's "stack window"
>approach, unless some very heavy pipelining is employed.

Well, that's a lot of stuff for one posting.  First off, the adder in
the register addressing path doesn't cost much, in the current implementation,
other than silicon area; reason:  the chip does register write before read,
and the write must complete before reads can be done so that an extra level
of forwarding (bypassing is another term often used) can be avoided.  So,
while the write is being done, we might as well do something useful like
an address add.  Machines like Berkeley RISC II (and I suspect SUN SPARC)
simply integrate the "add" into the decode tree.  This is somewhat less
expensive, but the idea is the same.  Ok, so the question comes up (at least
it did at AMD):  "What about other implementations, e.g. ECL?"  Well, it
is quite likely that a 4-stage pipeline is still going to be desirable in
other technologies.  It is also likely that the elimination of a stage of
forwarding with write-before-read is desirable.  At least with the kind of
ECL we were considering at AMD, things should scale reasonably well.  Sure
things aren't going to scale exactly, but we felt the large register file
was worth the risk of a slight mis-match in pipestage lengths.

As for the viability of flat register files, there is some evidence beyond
whatever you consider MIPS, et. al. to have provided.  David Wall of DECWRL
constructed what is basically an optimizing linker for the experimental
Titan machine.  This linker uses complete knowledge of the program object
code, and can even make use of fed-back run-time info, to construct a very
good register allocation.  The effect is somewhat like register windows.
Unfortunately, this technique is based on static analysis, even when run-time
info is fed back, and so doesn't do the same job as register windows.
However, in practice, it is worth something.  The Titan has 63 registers.
Wall didn't say how well the technique would work with fewer registers.

Anyway, here is the reference:

Wall, D. W., "Global Register Allocation At Link Time," ACM SIGPLAN Compiler
Construction Conference, Palo Alto, June 1986.

Sorry, I don't have the Vol and No info since I got the paper directly
from David.

Ok, so to address future speed advantages, yes there might be some speed
advantages for those with simple register files.  However, for the Am29000,
the critical paths were quite balanced (Dave Witt, are you out there?)
with, I believe, the TLB and/or instruction cache being the limiting
factor.  Next came the ALU, and then the register file.  Unless you want
to do things like spread the ALU cost over two pipestages (possible to do),
I don't think the register file is going to be the limiting factor.

First, I don't mean to imply that we considered all other implementation
technologies nor that we considered any very seriously.  Second, some of
what I said about the 29000 implementation may not be completely correct,
but the gist is correct.  Third, what do other people have to say?
Probably not much, since talking about this would be tantamount to disclosing
future plans.  Oh welwor