Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!pasteur!ames!oliveb!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Register Scoreboarding
Message-ID: <19463@winchester.mips.COM>
Date: 11 May 89 21:57:56 GMT
References: <24821@lll-winken.LLNL.GOV> <GRUNWALD.89May9113443@flute.cs.uiuc.edu> <3288@orca.WV.TEK.COM>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 114

In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes:
>On using object-code recompilers to port to new implementations of a
>non-scoreboarding architecture:
>
>The notion that the problems addressed by scoreboarding can be resolved
>at compile time is intellectually appealing, but not completely true --
>some latencies just cannot practically be predicted at compile time.
>The usual example is the added latency involved in a data cache miss.
>
>Clamping the whole CPU on cache miss isn't a technique that can survive
>into the 1990s.  I'm very curious to see what the non-scoreboard folks
>will do.

First, we need to get some definitions clear, and then some facts,
and then do some quick back-of-the-envelope numbers.

1) Of the currently popular RISCs, scoreboarding could be used or not,
on any of them. Scoreboarding, or lack thereof, seldom shows its head into
the architecture of these machines. [The MIPS load-delay slot being the one
obvious exception.]

2) Most of the existing RISCs obtain at least some, and perhaps, substantial
concurrency of FP operations.  As far as I can tell, most of them have
architectures that require interlocks or scoreboarding, as the latencies
and issue rates of FP ops are permitted to move around from implementation
to implementation, and FP ops tend to want to be multi-cycle, and tend to
be more amenable to more silicon.

3) I'm not sure about some of the other RISCs, but R3000s only clamp
the main pipeline on a cache miss, not the independent units, like:
	integer mul/div; FP add; FP mul; FP div
all of which will keep cranking away if they're already started.

4) In the analysis below, it is important to note that typical speed RISC
pipelines try to have a very regular flow of data in and out of the registers,
with as few ports as they can get away with.  Note for example, that if
you're doing out of order execution, and 3 operations finish together,
wanting into a 1-write-port register file, it will take 3 cycles for
them to do that...  This is one of the reasons that most of the RISCs
went with separate integer and FP regs, and maybe even 2-write-port FP regs,
even when using 1-write-port integer regs.
More genericly, suppose you design a machine so that you can start operations
even while others are still pending.  You'd better make sure that what you
gain at the front-end (by not stalling), that you don't lose mostly back
at the back-end (by then stalling waiting for register access.)

5) Back-of-the-envelope: how much would an R3000 gain by continuing execution
beyond a cache miss? (Assume M/2000-style machine; 20% loads, 10% stores;
about 25% memory system degradation on top of instruction cycles, on
real programs.)  Most of this will concern integer programs.
	a) I-cache-miss: it seems pretty hard to get beyond an I-cache miss;
	we already do the instruction-streaming gimmick to do what's possible
	to cover the latency.
	b) Load-misses (R3000s use write-thru caches, so there aren't really
		store misses).
	using the typical rule-of-thumb:
		20% loads
			70% of 1st load-delay slot fillable* (MIPS)
			30% of 2nd load-delay slot fillable (Stanford)
			10% of 3rd load-delay slot fillable (Stanford)
			i.e., these give you an idea of how far you can get
			before you REALLY need that data, because the
			compilers are trying hard already.
		I think that what this says, is that on the average,
		you get .7 * 1 + .3 * 2 + .1 * 1 = .7 + .6 + .1 = 1.4
		instructions beyond the load before you want that data.
	c) Suppose you get 90% hit-rate on data (not unreasonable),
	and you make a bunch of favorable assumptions:
	90% x 20% loads = 1.8% : % instructions that are cache-missing loads.
	Assume 1.4 instruction cycles gained by non-stalling.  This gives
	1.4 * 1.8% = 2.52 % that might be gained.  If you consider multi-cycle
	operations that could be fired up, than you might gain a little more.
	As a cross-check, in this machine, data-cache misses have something
	like 12 cycles of latency, then transfer 16 words of data, and NOT
	doing the scoreboarding is like adding 1.4 cycles to this.  If you
	(real approximate) figure that every extra cycle of latency costs
	you 1% of performance, then this gives you 1.4 x 1% = 1.4%,
	which (for the level of enveloping, is not bad).
	d) Now, why are these favorable assumptions?
		The instructions that follow the cache-missing loads might
		be MORE cache-missing loads, and even if you're willing to have
		multiple outstanding requests, you still have to get the data
		into the regs sooner or later, which probably means more write
		ports than you would have had. [mixing integer & FP will help]
		I.e., it is hard to make a generic description of the added
		stalls on the back-end of the process, as it is very dependent
		on the system partitioning, read/write ports, memory
		interface, etc.  If you had a lot of long-cycle-count
		operations, it would help you to start them while waiting
		for a cache-refill [i.e., long FP ops; of course, most of
		the frequent ones you want to be quick anyway...]

Bottom line: the part of scoreboarding that lets you continue beyond a
	load-cache-miss gets you 1-2%, in an R3000-like architecture,
	in back-of-the-envelope style, which of course really needs a
	lot of simulation to confirm.   Maybe other people can post some
	real DATA about this, as this was a quick guess for one machine.
	This is not to say scoreboarding is good or bad: it is a reasonable
	way to design things.  Note, of course, that one must be extremely
	careful to think thru any error-handling scenarios.  For example,
	if state is actually committed beyond a load-cache-miss [rather than
	register-relabeling games], what do you do if you get a transient
	bus-error when the loaded-data actually arrives?  You cannot just
	back up the PC and try again, unless you track any side-effects and
	undo them, or skip those instructions, or something.
	Of course, out-of-order initiation of FP ops usually implies the
	possibility of imprecise exceptions, which is OK, but does cost
	you somewhere. Out-of-order initiation of anything implies a little
	trickier exception-handling in general.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086