Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!sun-barr!apple!voder!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Register Scoreboarding Message-ID: <19661@winchester.mips.COM> Date: 13 May 89 19:18:08 GMT References: <24821@lll-winken.LLNL.GOV> <3288@orca.WV.TEK.COM> <19463@winchester.mips.COM> <170@dg.dg.com> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 135 In article <170@dg.dg.com> mpogue@dg.UUCP (Mike Pogue) writes: >In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >> >>Bottom line: the part of scoreboarding that lets you continue beyond a >> load-cache-miss gets you 1-2%, in an R3000-like architecture, > > John, > I think you have missed the point here. The performance improvement >due to register scoreboarding is only of minor interest. The real point >from an architectural point of view is that all binaries continue to run >in a predictable way, even when implementation details change. I MUST NOTE BE COMMUNICATING RIGHT, OR ELSE PEOPLE AREN'T HEARING WHAT I'M SAYING, OR THE NET IS LOSING POSTINGS: My posting: <19162@winchester.mips.COM> included: -In particular, Motorola & co are persistent in claiming that the world -will fall apart for MIPS if the timings of the floating-point operations -change, despite the fact that it has clearly been stated many times -that we have complete interlocking on ALL of the multi-cycle operations. -Really, the only things that don't have interlocking are loads and -equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle -delay that is predictable to the compilers. The (Without) in -Microprocessor (Without) Interlocking Pipeline Stages, which may have -been appropriate for the Stanford MIPS, is pretty much irrelevant -when it comes to MIPSco MIPS. -As I've said here before, if we ended up with loads that had another -cycle of latency, we;d build a machine with an interlock on the extra -cycle. If we decided to put in load interlocks, that would -be upward-compatible, although we'd likely compile 3rd-party executables -with R3000-style forever. (Of course, if we did add load interlocks at -some point, and if there got to be more of those machines around, at some -point maybe we'd start advising peopel to compile for that, and then do -a reverse-translate on R3000-machines!) -If the timings of floating-point operations -are different (and they are) in forthcoming products, the existing object -code works fine. However, even with completely interlocked and/or -scoreboarded code, you STILL want the compilers to be as aggressive -as possible. Fortunately, the way most of these things work, if you -try to optimize for the version with the longest latencies, it usually -works pretty well for ones with shorter latencies as well. To see this, -suppose you had a 5-cycle FP multiply, and so you'd been generating code -that tried to issue 4 more instructions before using the result of the -multiply. IF the multiply expanded to 10 cycles, the compiler folks -would try to work harder and find more things to do while the multiply -were running, which wouldn't usually hurt the machine with the 5-cycle -multiply. It's just a question of the number of stall cycles, and -it's obvious that it almost always pays to spread the computation of a -multi-cycle result, and the use of that result as far apart as possible. - -This, of course, is not remotely a new issue: any of the long-lived -computer product lines has faced this, especially those that -cover a range of implementation technologies, such as VAXen or S/360s. -The solutions are the same, except that the simplicity of RISC-style -instructions makes it marginally easier to manipulate object code. -Our experience with these methods tends to make us more willing to -consider object code translation as one more trick to use when it makes -sense, and it's really not that weird once you get used to it. So now, I'll try again. Here are some assertions that ahave appeared: ASSERTION 1: scoreboarding lets you modify latencies for different operations while using the same object code. ASSERTION 2: scoreboarding is the ONLY way to do this; if a fully-interlocked machine doesn't use scoreboarding, somehow, it is deemed impossible for the object of ASSERTION 1 to be accomplished. ASSERTION 3: well, if interlocking works after all, then scoreboarding is better for performance reasons. 3A: in supercomputers 3B: in current microprocessors ------ ASSERTION 1 is clearly true. ASSERTION 2 is silly; there are too many machines implemented without scoreboards, but with interlocks, that let you modify the latencies. I'll say, one more time, there is exactly one kind of user-visible latency in a MIPS R3000 that is not interlocked, and that's the {load, move-from-coprocessor}, and it's one cycle that needs to be covered by the compiler. The the likely-to-be-variable-and/or-long latencies [FP, int mul/div] are all fully-interlocked, even though the compilers work hard to do their best to schedule the pipelines well. If we did an implementation where or simulations claimed that out-of-order instruction initiation was a sensible overall design choice, then we might use scorebaording as an implementation choice. My back-of-the-envelope analysis showed why we weren't yet overly excited about that. It is CRAZY to build a machine with multiple-overlapped-functional-units, and NOT do pipeline-scheduling compilers [which, after all, were present in early CDC compilers, for example] whether one uses scoreboarding (to permit out-of-order initiation) or interlocking (that expects in-order initiation, but freezes the instruction-issue unit (not necessarily the other functional units) until the needed result is ready. The amount of code in MIPS compilers to put a nop after a load if no other instruction ca be found is trivial [when I looked last, I found 3 lines of code that were doing that] compared to the rest of the reorganizations. People who worry about this being a cause of bugs or complexity in the compilers (compared with everything else) should NEVER, EVER fly on a Boeing 747: after all, going from SanFrancisco to London, you might fall out of your seat and break your leg due to a defective seat-belt buckle :-) ASSERTION 3 3A: supercomputers Probably true 3B: current microprocessors Seems unlikelym until they start having memory systems like 3A. I'm out of time & getting the SPEC bench-a-thon together is going to occupy most of my time for the next couple weeks, so maybe somebody else can continue this. Specifically, maybe brooks@maddog.llnl.gov (or anybody else really familiar with supercomputer architecture) could describe the memory systems of such things. There are some fairly sensible reasons why the answers on 3A & 3B might be opposite.... Finally, maybe somebody from 88K-land could describe how far into out-of-order execution the 88K goes, i.e., assuming no scoreboard block, 1) how many instructions can be issued beyond a load that cache-misses, or tlb-misses, or both? 2) how many instructions beyond a stalled-FP-multiply (for example) can you execute? (I haven't seen anything that said definitely what the current 88K's do). Like both 88K and MIPS, SPARC is defined to allow different-latency FP implementations, and in fact, 3 different ones are already extant. Perhaps the SPARC guys would care to join the fun and talk about differences in latencies, overlap, etc. [If you haven't noticed it, SUn-4s recent got the FPU2 that raised FP performance in the same systems.] -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086