Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!pyramid!voder!apple!bcase From: bcase@apple.UUCP (Brian Case) Newsgroups: comp.arch Subject: Re: register windows Message-ID: <6681@apple.UUCP> Date: Mon, 9-Nov-87 13:28:08 EST Article-I.D.: apple.6681 Posted: Mon Nov 9 13:28:08 1987 Date-Received: Wed, 11-Nov-87 07:10:19 EST References: <230@usl-pc.UUCP> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc., Cupertino, USA Lines: 69 In article <230@usl-pc.UUCP> jpdres10@usl-pc.UUCP (Green Eric Lee) writes: >I've been following this conversation for some time. I wonder, how >much does the adder in the register path (to add the register stack >pointer to the instruction's desired register) slow things down in the >AMD29000? It might also seem like if you have register windows, you >have the same thing sitting there in the middle of the address path >between instruction latch and register addressing, but one scheme I've >read about is to make your register windows out of shift-registers, >i.e. no adding, as far as the machine is concerned, you just have your >16 registers out there or whatever, and when a procedure call is made, >or a return made, shift the current registers down or up (with some >provisions for stack overflow). And, of course, Plain Old Registers >have no problem at all with something out there in the register >addressing path, except, of course, the decode tree... it seems >offhand that the MIPS approach, with lots of registers and a very >smart compiler capable of allocating them to procedures as needed, >might offer some potential future speed advantages over traditional >(non-shift-register) register windows or AMD's "stack window" >approach, unless some very heavy pipelining is employed. Well, that's a lot of stuff for one posting. First off, the adder in the register addressing path doesn't cost much, in the current implementation, other than silicon area; reason: the chip does register write before read, and the write must complete before reads can be done so that an extra level of forwarding (bypassing is another term often used) can be avoided. So, while the write is being done, we might as well do something useful like an address add. Machines like Berkeley RISC II (and I suspect SUN SPARC) simply integrate the "add" into the decode tree. This is somewhat less expensive, but the idea is the same. Ok, so the question comes up (at least it did at AMD): "What about other implementations, e.g. ECL?" Well, it is quite likely that a 4-stage pipeline is still going to be desirable in other technologies. It is also likely that the elimination of a stage of forwarding with write-before-read is desirable. At least with the kind of ECL we were considering at AMD, things should scale reasonably well. Sure things aren't going to scale exactly, but we felt the large register file was worth the risk of a slight mis-match in pipestage lengths. As for the viability of flat register files, there is some evidence beyond whatever you consider MIPS, et. al. to have provided. David Wall of DECWRL constructed what is basically an optimizing linker for the experimental Titan machine. This linker uses complete knowledge of the program object code, and can even make use of fed-back run-time info, to construct a very good register allocation. The effect is somewhat like register windows. Unfortunately, this technique is based on static analysis, even when run-time info is fed back, and so doesn't do the same job as register windows. However, in practice, it is worth something. The Titan has 63 registers. Wall didn't say how well the technique would work with fewer registers. Anyway, here is the reference: Wall, D. W., "Global Register Allocation At Link Time," ACM SIGPLAN Compiler Construction Conference, Palo Alto, June 1986. Sorry, I don't have the Vol and No info since I got the paper directly from David. Ok, so to address future speed advantages, yes there might be some speed advantages for those with simple register files. However, for the Am29000, the critical paths were quite balanced (Dave Witt, are you out there?) with, I believe, the TLB and/or instruction cache being the limiting factor. Next came the ALU, and then the register file. Unless you want to do things like spread the ALU cost over two pipestages (possible to do), I don't think the register file is going to be the limiting factor. First, I don't mean to imply that we considered all other implementation technologies nor that we considered any very seriously. Second, some of what I said about the 29000 implementation may not be completely correct, but the gist is correct. Third, what do other people have to say? Probably not much, since talking about this would be tantamount to disclosing future plans. Oh welwor