Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!pasteur!ames!oliveb!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Register Scoreboarding Message-ID: <19463@winchester.mips.COM> Date: 11 May 89 21:57:56 GMT References: <24821@lll-winken.LLNL.GOV> <3288@orca.WV.TEK.COM> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 114 In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes: >On using object-code recompilers to port to new implementations of a >non-scoreboarding architecture: > >The notion that the problems addressed by scoreboarding can be resolved >at compile time is intellectually appealing, but not completely true -- >some latencies just cannot practically be predicted at compile time. >The usual example is the added latency involved in a data cache miss. > >Clamping the whole CPU on cache miss isn't a technique that can survive >into the 1990s. I'm very curious to see what the non-scoreboard folks >will do. First, we need to get some definitions clear, and then some facts, and then do some quick back-of-the-envelope numbers. 1) Of the currently popular RISCs, scoreboarding could be used or not, on any of them. Scoreboarding, or lack thereof, seldom shows its head into the architecture of these machines. [The MIPS load-delay slot being the one obvious exception.] 2) Most of the existing RISCs obtain at least some, and perhaps, substantial concurrency of FP operations. As far as I can tell, most of them have architectures that require interlocks or scoreboarding, as the latencies and issue rates of FP ops are permitted to move around from implementation to implementation, and FP ops tend to want to be multi-cycle, and tend to be more amenable to more silicon. 3) I'm not sure about some of the other RISCs, but R3000s only clamp the main pipeline on a cache miss, not the independent units, like: integer mul/div; FP add; FP mul; FP div all of which will keep cranking away if they're already started. 4) In the analysis below, it is important to note that typical speed RISC pipelines try to have a very regular flow of data in and out of the registers, with as few ports as they can get away with. Note for example, that if you're doing out of order execution, and 3 operations finish together, wanting into a 1-write-port register file, it will take 3 cycles for them to do that... This is one of the reasons that most of the RISCs went with separate integer and FP regs, and maybe even 2-write-port FP regs, even when using 1-write-port integer regs. More genericly, suppose you design a machine so that you can start operations even while others are still pending. You'd better make sure that what you gain at the front-end (by not stalling), that you don't lose mostly back at the back-end (by then stalling waiting for register access.) 5) Back-of-the-envelope: how much would an R3000 gain by continuing execution beyond a cache miss? (Assume M/2000-style machine; 20% loads, 10% stores; about 25% memory system degradation on top of instruction cycles, on real programs.) Most of this will concern integer programs. a) I-cache-miss: it seems pretty hard to get beyond an I-cache miss; we already do the instruction-streaming gimmick to do what's possible to cover the latency. b) Load-misses (R3000s use write-thru caches, so there aren't really store misses). using the typical rule-of-thumb: 20% loads 70% of 1st load-delay slot fillable* (MIPS) 30% of 2nd load-delay slot fillable (Stanford) 10% of 3rd load-delay slot fillable (Stanford) i.e., these give you an idea of how far you can get before you REALLY need that data, because the compilers are trying hard already. I think that what this says, is that on the average, you get .7 * 1 + .3 * 2 + .1 * 1 = .7 + .6 + .1 = 1.4 instructions beyond the load before you want that data. c) Suppose you get 90% hit-rate on data (not unreasonable), and you make a bunch of favorable assumptions: 90% x 20% loads = 1.8% : % instructions that are cache-missing loads. Assume 1.4 instruction cycles gained by non-stalling. This gives 1.4 * 1.8% = 2.52 % that might be gained. If you consider multi-cycle operations that could be fired up, than you might gain a little more. As a cross-check, in this machine, data-cache misses have something like 12 cycles of latency, then transfer 16 words of data, and NOT doing the scoreboarding is like adding 1.4 cycles to this. If you (real approximate) figure that every extra cycle of latency costs you 1% of performance, then this gives you 1.4 x 1% = 1.4%, which (for the level of enveloping, is not bad). d) Now, why are these favorable assumptions? The instructions that follow the cache-missing loads might be MORE cache-missing loads, and even if you're willing to have multiple outstanding requests, you still have to get the data into the regs sooner or later, which probably means more write ports than you would have had. [mixing integer & FP will help] I.e., it is hard to make a generic description of the added stalls on the back-end of the process, as it is very dependent on the system partitioning, read/write ports, memory interface, etc. If you had a lot of long-cycle-count operations, it would help you to start them while waiting for a cache-refill [i.e., long FP ops; of course, most of the frequent ones you want to be quick anyway...] Bottom line: the part of scoreboarding that lets you continue beyond a load-cache-miss gets you 1-2%, in an R3000-like architecture, in back-of-the-envelope style, which of course really needs a lot of simulation to confirm. Maybe other people can post some real DATA about this, as this was a quick guess for one machine. This is not to say scoreboarding is good or bad: it is a reasonable way to design things. Note, of course, that one must be extremely careful to think thru any error-handling scenarios. For example, if state is actually committed beyond a load-cache-miss [rather than register-relabeling games], what do you do if you get a transient bus-error when the loaded-data actually arrives? You cannot just back up the PC and try again, unless you track any side-effects and undo them, or skip those instructions, or something. Of course, out-of-order initiation of FP ops usually implies the possibility of imprecise exceptions, which is OK, but does cost you somewhere. Out-of-order initiation of anything implies a little trickier exception-handling in general. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086