Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!sun-barr!apple!voder!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Register Scoreboarding
Message-ID: <19661@winchester.mips.COM>
Date: 13 May 89 19:18:08 GMT
References: <24821@lll-winken.LLNL.GOV> <GRUNWALD.89May9113443@flute.cs.uiuc.edu> <3288@orca.WV.TEK.COM> <19463@winchester.mips.COM> <170@dg.dg.com>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 135

In article <170@dg.dg.com> mpogue@dg.UUCP (Mike Pogue) writes:
>In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>
>>Bottom line: the part of scoreboarding that lets you continue beyond a
>>	load-cache-miss gets you 1-2%, in an R3000-like architecture,
>
>  John, 
>	I think you have missed the point here.  The performance improvement
>due to register scoreboarding is only of minor interest.  The real point
>from an architectural point of view is that all binaries continue to run
>in a predictable way, even when implementation details change.

I MUST NOTE BE COMMUNICATING RIGHT, OR ELSE PEOPLE AREN'T HEARING
WHAT I'M SAYING, OR THE NET IS LOSING POSTINGS:
My posting: <19162@winchester.mips.COM> included:
-In particular, Motorola & co are persistent in claiming that the world
-will fall apart for MIPS if the timings of the floating-point operations
-change, despite the fact that it has clearly been stated many times
-that we have complete interlocking on ALL of the multi-cycle operations.
-Really, the only things that don't have interlocking are loads and
-equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle
-delay that is predictable to the compilers.  The (Without) in
-Microprocessor (Without) Interlocking Pipeline Stages, which may have
-been appropriate for the Stanford MIPS, is pretty much irrelevant
-when it comes to MIPSco MIPS.
-As I've said here before, if we ended up with loads that had another
-cycle of latency, we;d build a machine with an interlock on the extra
-cycle.  If we decided to put in load interlocks, that would
-be upward-compatible, although we'd likely compile 3rd-party executables
-with R3000-style forever. (Of course, if we did add load interlocks at
-some point, and if there got to be more of those machines around, at some
-point maybe we'd start advising peopel to compile for that, and then do
-a reverse-translate on R3000-machines!)
-If the timings of floating-point operations
-are different (and they are) in forthcoming products, the existing object
-code works fine.  However, even with completely interlocked and/or
-scoreboarded code, you STILL want the compilers to be as aggressive
-as possible.  Fortunately, the way most of these things work, if you
-try to optimize for the version with the longest latencies, it usually
-works pretty well for ones with shorter latencies as well.  To see this,
-suppose you had a 5-cycle FP multiply, and so you'd been generating code
-that tried to issue 4 more instructions before using the result of the
-multiply.  IF the multiply expanded to 10 cycles, the compiler folks
-would try to work harder and find more things to do while the multiply
-were running, which wouldn't usually hurt the machine with the 5-cycle
-multiply.  It's just a question of the number of stall cycles, and
-it's obvious that it almost always pays to spread the computation of a
-multi-cycle result, and the use of that result as far apart as possible.
-
-This, of course, is not remotely a new issue: any of the long-lived
-computer product lines has faced this, especially those that
-cover a range of implementation technologies, such as VAXen or S/360s.
-The solutions are the same, except that the simplicity of RISC-style
-instructions makes it marginally easier to manipulate object code.
-Our experience with these methods tends to make us more willing to
-consider object code translation as one more trick to use when it makes
-sense, and it's really not that weird once you get used to it.

So now, I'll try again.  Here are some assertions that ahave appeared:

ASSERTION 1: scoreboarding lets you modify latencies for
different operations while using the same object code.

ASSERTION 2: scoreboarding is the ONLY way to do this; if a
fully-interlocked machine doesn't use scoreboarding, somehow, it is deemed
impossible for the object of ASSERTION 1 to be accomplished.

ASSERTION 3: well, if interlocking works after all, then scoreboarding
is better for performance reasons.
	3A: in supercomputers
	3B: in current microprocessors
------
ASSERTION 1 is clearly true.
ASSERTION 2 is silly; there are too many machines implemented without
scoreboards, but with interlocks, that let you modify the latencies.  I'll say,
one more time, there is exactly one kind of user-visible latency in a MIPS R3000
that is not interlocked, and that's the {load, move-from-coprocessor},
and it's one cycle that needs to be covered by the compiler.  The
the likely-to-be-variable-and/or-long latencies [FP, int mul/div] are
all fully-interlocked, even though the compilers work hard to do their
best to schedule the pipelines well.  If we did an implementation where
or simulations claimed that out-of-order instruction initiation was a
sensible overall design choice, then we might use scorebaording as an
implementation choice.  My back-of-the-envelope analysis showed why we
weren't yet overly excited about that.
It is CRAZY to build a machine with multiple-overlapped-functional-units,
and NOT do pipeline-scheduling compilers [which, after all, were present
in early CDC compilers, for example] whether one uses scoreboarding
(to permit out-of-order initiation) or interlocking (that expects
in-order initiation, but freezes the instruction-issue unit (not necessarily the
other functional units) until the needed result is ready.
The amount of code in MIPS compilers to put a nop after a load if
no other instruction ca be found is trivial [when I looked last,
I found 3 lines of code that were doing that] compared to the rest of
the reorganizations.  People who worry about this being a cause of bugs
or complexity in the compilers (compared with everything else)
should NEVER, EVER fly on a Boeing 747: after all, going from SanFrancisco
to London, you might fall out of your seat and break your leg due to a
defective seat-belt buckle :-)

ASSERTION 3
	3A: supercomputers
		Probably true
	3B: current microprocessors
		Seems unlikelym until they start having memory systems
		like 3A.

I'm out of time & getting the SPEC bench-a-thon together is going to
occupy most of my time for the next couple weeks, so maybe somebody
else can continue this.  Specifically, maybe brooks@maddog.llnl.gov
(or anybody else really familiar with supercomputer architecture)
could describe the memory systems of such things.  There are some
fairly sensible reasons why the answers on 3A & 3B might be
opposite....

Finally, maybe somebody from 88K-land could describe how far into
out-of-order execution the 88K goes, i.e., assuming no scoreboard
block,
	1) how many instructions can be issued beyond a load that
	cache-misses, or tlb-misses, or both?
	2) how many instructions beyond a stalled-FP-multiply
	(for example) can you execute?
(I haven't seen anything that said definitely what the current 88K's do).

Like both 88K and MIPS, SPARC is defined to allow different-latency
FP implementations, and in fact, 3 different ones are already extant.
Perhaps the SPARC guys would care to join the fun and talk about
differences in latencies, overlap, etc.  [If you haven't noticed it,
SUn-4s recent got the FPU2 that raised FP performance in the same
systems.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086