Path: utzoo!attcan!uunet!husc6!mailrus!ames!oliveb!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Message-ID: <2438@winchester.mips.COM>
Date: 19 Jun 88 05:43:10 GMT
References: <1082@nud.UUCP>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 187

In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>    On RISC processors without a scoreboard*, how are the results of memory
>references guaranteed to be available before they are used in subsequent
>computations?....

This was discussed a while back.  Sounds like it's time to do it again.
NOTE: caveats: there are certainly architectures for which some ofthe
following doesn't fit.  It is intended to be mostly applicable to
"mainline" RISC micros.

>2)   If the compiler/assembly writer is required to wait a certain number
>of ticks after a load before using the results of the load, how is the
>required amount of delay determined?   Can you assume you always get a
>cache hit (I doubt it)?....

Following are:
1) How the MIPS R2000/R3000 does it.
2) Some thoughts on when scoreboarding does/doesn't make sense, and why.

HOW THE R2000/R3000 WORK

1) A load requires one cycle to initiate, plus one "architectural"
cycle of latency, which must be covered by the software,
i.e., it must insert a nop if it can't find anything useful.

2) If we ever built a machine in which the primary cache had 2 or more
cycles of latency, the software would be responsible for the 1st cycle of
latency, and the hardware responsible for cycles 2,...n. (This preserves the
ability to run the existing object code unchanged.)  However, given the
domain of problems we worry about, this is probably irrelevant, as we'll
probably never build a system whose first-level cache has more than 1 cycle of
latency, given the performance impacts.  Of course, it's hard to build
a machine that has a zero-cycle load latency :-)  I.e., there's a
reason for having exactly 1 cycle of latency, not 2 or 0.

3) A lot of statistics say:
	a) Reorganizers get to fill 70-80% of load delays in 1-cycle
	latency machines.
	b) Roerganizers get to fill (as I recall, it's been awhile)
	30% of delays in 2-cycle latency machines.
	c) After that, it gets bad (10% or less).  [What does this mean?
	ans: if you load something, you expect to use it pretty soon.]

4) All of that assumes a cache-hit (which is, of course, the most
frequent case).  If there is a cache-miss, the main pipeline freezes
while the cache refill is done:
	a) One or more words of data are brought into the D-cache.
	b) The stalled instruction resumes after the fixup.
Thus, the software could care less how far away memory is, whether or
not DRAM refresh is running, etc.

5) There are split I & D-caches, and a 1-cycle latency branch is used,
with software also being required to fill the branch delay slot.
Approximately similar numbers (1 slot rather than 0 or 2) apply as
in the load case.

6) As it happens, when I said the main pipeline freezes, what I glossed over
was that there are numerous independent functional units:
	a) The main integer unit (i.e., that really knows where the PC is),
	including all loads/stores.
	b) Integer multiplier/divider
	c) FP adder
	d) FP multiplier
	e) FP divider
The results of b) - e) are interlocked, i.e., if you attempt to use the
result of b) - e), and it isn't ready, it is interlocked by the hardware,
rather than software.  You can consider this "partial scoreboarding",
although we don't particularly call it that, as the main pipeline does
completely freeze, rather than trying to issue the following instruction.

7) Now, why should it be that we interlock some things, and not others?

a) Note that the 2 things that are NOT interlocked (load and branch):
	Have small (1-cycle delays) that are unlikely to be shortened,
		and are undesirable to lengthen
	Have about a 70% chance of having a 1-cycle delay slot filled
	by software.

b) Note that the things that ARE interlocked:
	Are multiple cycle operations (DP ADD = 2, rest are longer).
	Are operations whose performance might well change in a future
	version (i.e., you might throw more silicon at DP MULT to improve
	it from 5 cycles, etc)
	Are generally difficult for compilers to handle completely, for
	both ofthe above reasons.

SCOREBOARDING

1) If you look at where scoreboarding came from (I remember it first from
CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
you had machines with:
	a) Many independent functional units
	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
	for example, and the FP operations were of course longer).
	c) Multiple memory pipes, with a long latency to memory, and no caches.

2) Note that RISC micros probably have a), but normally, only the FP ops,
and maybe a few others have long multi-cycle operations, and they sure don't
have multiple memory pipes, and they often have caches.

3) Whether or not a RISC machine has no scoreboarding, some, or complete,
it had better be designed with a pipeline reorganizer in mind, because it's
absolutely silly not to be using one.  With one exception (in our case),
a scoreboard can NEVER do any better than a reorganizer.  I'll show the
exception later, but here is an example why I say this.  Consider:
	a = b + c; d = 0;
1	lw	r1,b
2	lw	r2,c
3	add	r3,r2,r1
4	sw	r3,a
5	sw	zero,d

On a machine with a simple load-interlock, and 1-cycle-latency loads,
the machine would stall for 1 cycle before instruction 3.
Suppose you do more complex scoreboarding, i.e., you continue attempting
to issue instructions?  Then, you might do 1, 2, try 3 and save it,
and then either come back and do 3 again (probably) or go on to 4 and
discover that it stalls also.  If one looks at integer code sequences,
one discovers that it is hard to discover many things that don't quickly
depend on something else (barring, of course, Multiflow-style compiler
technology and hardware designs not currently feasible in VLSI micros).

Now, the point is that no simple scoreboarding hardware is going to be able
to figure out that the best sequence is:
	a = b + c; d = 0;
1	lw	r1,b
2	lw	r2,c
5	sw	zero,d
3	add	r3,r2,r1
4	sw	r3,a

which ALWAYS wins (given the memory system described).  Note that
a 2-cycle latency adds 1 cycle more of delay, even with the reorganized
code.  Also, note that if you're willing to add mucho more
hardware, you can lookahead further (really, keeping 1 independent PC or
pipeline for each extra instruction, but compilers can always lookahead
further).  Similar things go on for floating-point, i.e., no matter
how much scoreboarding you do, the compiler MUST do some code-spreading
or you will lose performance.

Note also that streamlined RISC micros usually do 2 register reads and
1 write/cycle, it's hard to get anything else sque

As one last example, one of the MIPers here once wrote a reorganizer for
a machine architecture that was fully-interlocked, and improved the
performance at least 5% on some benchmarks.

SUMMARY OF THIS PART: any pipelined machine ought to be done expecting to have
a good reorganizer, even if fully scoreboarded.

The exception I mentioned:
	function(x)	=>
	
	lw	r2,x
	jal	function
	nop

We can't move the lw to cover the jal-delay slot, because we're not sure
the function's first instruction doesn't use it.  It turns out that the
number of jal-delay slots left unfilled is fairly low, and the fraction
that might have been filled by a load is even lower, as there is almost
always some argument-gathering that can be moved into the delay slot.
(Most of the jal-delay slots come from code like this:
	if (something)
		func1();
	else
		func2();
of course, one can turn on a linker-optimizer that looks at the targets
of jal's and copies the target into the delay slot if it can.

Anyway, I think the bottom-line is that full scoreboarding:

	Is not very relevant to the integer part of the CPU.

	Might be useful for long multi-cycle operations (such as FP),
	maybe (although Earl Killian has a long analysis that
	argues otherwise that I don't have time to condense & post;
	maybe Earl will),

	Might be helpful for compilers that can't do code scheduling,
	even though it is silly to do high-performance pipelined RISCs
	without having a good reorganizer.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086