Path: utzoo!attcan!uunet!husc6!mailrus!ames!oliveb!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Message-ID: <2438@winchester.mips.COM> Date: 19 Jun 88 05:43:10 GMT References: <1082@nud.UUCP> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 187 In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: > On RISC processors without a scoreboard*, how are the results of memory >references guaranteed to be available before they are used in subsequent >computations?.... This was discussed a while back. Sounds like it's time to do it again. NOTE: caveats: there are certainly architectures for which some ofthe following doesn't fit. It is intended to be mostly applicable to "mainline" RISC micros. >2) If the compiler/assembly writer is required to wait a certain number >of ticks after a load before using the results of the load, how is the >required amount of delay determined? Can you assume you always get a >cache hit (I doubt it)?.... Following are: 1) How the MIPS R2000/R3000 does it. 2) Some thoughts on when scoreboarding does/doesn't make sense, and why. HOW THE R2000/R3000 WORK 1) A load requires one cycle to initiate, plus one "architectural" cycle of latency, which must be covered by the software, i.e., it must insert a nop if it can't find anything useful. 2) If we ever built a machine in which the primary cache had 2 or more cycles of latency, the software would be responsible for the 1st cycle of latency, and the hardware responsible for cycles 2,...n. (This preserves the ability to run the existing object code unchanged.) However, given the domain of problems we worry about, this is probably irrelevant, as we'll probably never build a system whose first-level cache has more than 1 cycle of latency, given the performance impacts. Of course, it's hard to build a machine that has a zero-cycle load latency :-) I.e., there's a reason for having exactly 1 cycle of latency, not 2 or 0. 3) A lot of statistics say: a) Reorganizers get to fill 70-80% of load delays in 1-cycle latency machines. b) Roerganizers get to fill (as I recall, it's been awhile) 30% of delays in 2-cycle latency machines. c) After that, it gets bad (10% or less). [What does this mean? ans: if you load something, you expect to use it pretty soon.] 4) All of that assumes a cache-hit (which is, of course, the most frequent case). If there is a cache-miss, the main pipeline freezes while the cache refill is done: a) One or more words of data are brought into the D-cache. b) The stalled instruction resumes after the fixup. Thus, the software could care less how far away memory is, whether or not DRAM refresh is running, etc. 5) There are split I & D-caches, and a 1-cycle latency branch is used, with software also being required to fill the branch delay slot. Approximately similar numbers (1 slot rather than 0 or 2) apply as in the load case. 6) As it happens, when I said the main pipeline freezes, what I glossed over was that there are numerous independent functional units: a) The main integer unit (i.e., that really knows where the PC is), including all loads/stores. b) Integer multiplier/divider c) FP adder d) FP multiplier e) FP divider The results of b) - e) are interlocked, i.e., if you attempt to use the result of b) - e), and it isn't ready, it is interlocked by the hardware, rather than software. You can consider this "partial scoreboarding", although we don't particularly call it that, as the main pipeline does completely freeze, rather than trying to issue the following instruction. 7) Now, why should it be that we interlock some things, and not others? a) Note that the 2 things that are NOT interlocked (load and branch): Have small (1-cycle delays) that are unlikely to be shortened, and are undesirable to lengthen Have about a 70% chance of having a 1-cycle delay slot filled by software. b) Note that the things that ARE interlocked: Are multiple cycle operations (DP ADD = 2, rest are longer). Are operations whose performance might well change in a future version (i.e., you might throw more silicon at DP MULT to improve it from 5 cycles, etc) Are generally difficult for compilers to handle completely, for both ofthe above reasons. SCOREBOARDING 1) If you look at where scoreboarding came from (I remember it first from CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones), you had machines with: a) Many independent functional units b) Many multiple-cycle operations (integer add was 3 cycles in 6600, for example, and the FP operations were of course longer). c) Multiple memory pipes, with a long latency to memory, and no caches. 2) Note that RISC micros probably have a), but normally, only the FP ops, and maybe a few others have long multi-cycle operations, and they sure don't have multiple memory pipes, and they often have caches. 3) Whether or not a RISC machine has no scoreboarding, some, or complete, it had better be designed with a pipeline reorganizer in mind, because it's absolutely silly not to be using one. With one exception (in our case), a scoreboard can NEVER do any better than a reorganizer. I'll show the exception later, but here is an example why I say this. Consider: a = b + c; d = 0; 1 lw r1,b 2 lw r2,c 3 add r3,r2,r1 4 sw r3,a 5 sw zero,d On a machine with a simple load-interlock, and 1-cycle-latency loads, the machine would stall for 1 cycle before instruction 3. Suppose you do more complex scoreboarding, i.e., you continue attempting to issue instructions? Then, you might do 1, 2, try 3 and save it, and then either come back and do 3 again (probably) or go on to 4 and discover that it stalls also. If one looks at integer code sequences, one discovers that it is hard to discover many things that don't quickly depend on something else (barring, of course, Multiflow-style compiler technology and hardware designs not currently feasible in VLSI micros). Now, the point is that no simple scoreboarding hardware is going to be able to figure out that the best sequence is: a = b + c; d = 0; 1 lw r1,b 2 lw r2,c 5 sw zero,d 3 add r3,r2,r1 4 sw r3,a which ALWAYS wins (given the memory system described). Note that a 2-cycle latency adds 1 cycle more of delay, even with the reorganized code. Also, note that if you're willing to add mucho more hardware, you can lookahead further (really, keeping 1 independent PC or pipeline for each extra instruction, but compilers can always lookahead further). Similar things go on for floating-point, i.e., no matter how much scoreboarding you do, the compiler MUST do some code-spreading or you will lose performance. Note also that streamlined RISC micros usually do 2 register reads and 1 write/cycle, it's hard to get anything else sque As one last example, one of the MIPers here once wrote a reorganizer for a machine architecture that was fully-interlocked, and improved the performance at least 5% on some benchmarks. SUMMARY OF THIS PART: any pipelined machine ought to be done expecting to have a good reorganizer, even if fully scoreboarded. The exception I mentioned: function(x) => lw r2,x jal function nop We can't move the lw to cover the jal-delay slot, because we're not sure the function's first instruction doesn't use it. It turns out that the number of jal-delay slots left unfilled is fairly low, and the fraction that might have been filled by a load is even lower, as there is almost always some argument-gathering that can be moved into the delay slot. (Most of the jal-delay slots come from code like this: if (something) func1(); else func2(); of course, one can turn on a linker-optimizer that looks at the targets of jal's and copies the target into the delay slot if it can. Anyway, I think the bottom-line is that full scoreboarding: Is not very relevant to the integer part of the CPU. Might be useful for long multi-cycle operations (such as FP), maybe (although Earl Killian has a long analysis that argues otherwise that I don't have time to condense & post; maybe Earl will), Might be helpful for compilers that can't do code scheduling, even though it is silly to do high-performance pipelined RISCs without having a good reorganizer. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086