Path: utzoo!utgpu!watmath!clyde!mcdchg!chinet!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!bu-cs!purdue!decwrl!sun!pitstop!sundc!seismo!uunet!portal!cup.portal.com!bcase From: bcase@cup.portal.com (Brian bcase Case) Newsgroups: comp.arch Subject: Re: Longer load/store because of register windows Message-ID: <10604@cup.portal.com> Date: 28 Oct 88 19:22:53 GMT References: <156@gloom.UUCP> <310@lynx.zyx.SE> <332@pvab.UUCP> <15964@agate Organization: The Portal System (TM) Lines: 75 >Having a multiple-window register file, or more precisely, having many >registers, slows down the processor cycle. With a longer cycle the >load/store accesses become slower. There are two reasons: 1) for a large >register file, let's say 128 registers, the decoding of the registers >addresses is longer (more bits to decode, even if you use partial decoding >there is still a penalty), 2) the data bus is longer because it has to go >over so many registers. A longer data bus implies larger capacitance and >longer discharge time, thus longer processor cycle. This is true in theory. Howvever there are two effects that prevent a (within reason) large reg. file from slowing down the basic cycle: Circuit design can solve some speed problems, its just a matter of spending the power budget. It is possible to make a register file that reads and writes *at the same time*. But it is larger and more wasteful. Yes, decoding time and capacitive effects can be a problem. But even for the 29K's reg file, the access time is something like 10ns, and that's good old 1.25 micron technology. However, no one designs a reg. file that reads and writes at the same time because it isn't necessary; the register file isn't the critical path. Things like the cache tag access-> compare-> set_select and ALU_subtract->condition_code (or its equivalent) are the harder things. Again, clever circuit design applies, but what usually happens is that a compromise is reached so that almost everything "is the critical path." If you think about it, you see that anything else would be a poor, unbalanced design, and the circuit designers would get fired! >You can play some tricks to get around those drawbacks, >for example the Am29000 uses overlapping to avoid the penalty caused by >the decoding. The 29k does no more overlapping than the next guy, to my knowledge. BUT, I am not one of the circuit designers, so I might be wrong. >The Intel 80960 uses a cache for local register sets. >I haven't seen the layout :-), but it seems like the sets are separated >in a way that the data bus is not lengthened. Again, the lengthening of the data bus is a concern, but need not be a problem. In a reasonable architecture, a pipeline register sits right after the register file and right after the ALU. The lengthening of the data bus is important only for those internal bus operations that have to traverse the entire length of the processor bit cells. Very few operations have to do this (maybe things involving indirect jumps or something; the PC section might be at the end opposite the reg. file in the bit cell), and, again, you just have to have mega drivers for them. I don't mean to say this is *free*, but it is possible, and so a large register file does not necessarily slow the processor cycle. >So the question is: Is it clever to invest in a large register file with >windows or is it better to use the silicon for other circuitry? >The answer depends on how good your compiler people are! Well, the windows question is a hard one, so I won't even try that one. But as for *lots* of registers: Registers have three times the bandwidth and much less latency than the next level in the memory hierarchy. Maybe I can't think of what to do with them today (actually, I can think of several things), but putting a critical resource in a register instead of a cache makes a BIG difference on almost any machine, certainly on simple ones. The 29k UNIX implementations keep operating system goodies in protected global registers while user code is running. So do some CISC processors, but in them, you have no choice about what gets put there, it's defined by the architecture. Lots of registers give you a powerful resource that you can use anyway you want! When 29k compilers get the sort of capability found in the DEC Titan compilers (wishful thinking?), the abilty to do universal register allocation, global registers will be useful for keeping a program's global scalars in registers. This will probably necessitate implementing the "missing" 64 global registers because the greedy OS guys have already hoarded most of the 64 existing globals. :-)