Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!uxc!uxc.cso.uiuc.edu!mcdurb!aglew
From: aglew@mcdurb.Urbana.Gould.COM
Newsgroups: comp.arch
Subject: Re: Register Scoreboarding
Message-ID: <28200312@mcdurb>
Date: 13 May 89 15:13:00 GMT
References: <24821@lll-winken.LLNL.GOV>
Lines: 114
Nf-ID: #R:lll-winken.LLNL.GOV:24821:mcdurb:28200312:000:5603
Nf-From: mcdurb.Urbana.Gould.COM!aglew    May 13 10:13:00 1989


>i'm curious too, but about scoreboarding. 
>will somebody please point me at some article that shows what kind of
>logic implements scoreboarding?  i have questoins like

Mark Johnson pointed out an Intel ISSCC (I always get those initials wrong)
paper on circuit implementations of scoreboarding for the '960.

Here are a few references at a slightly higher level of abstraction:

Keller, "Look-Ahead Processors", Computing Surveys Vol 7 No 4 December 1975,
	pp 177-195 (good bibliography).
Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
	IBM Journal Resarch and DEvelopment, 11, 1 (Jan 1967) pp 25-33
Hwu, "Exploiting Instruction Concurrency to Acheive High Performance
	in a Single-chip Microprocessor", PhD thesis, Report No. UCB/CSD 
	88/398, University of California, Berkeley, January 1988
	(sinilar papers in COMP ARCH 13 and IEEE Trans Comp)

The above references are for varieties of "Tomasulo" scheduling, which is a bit
more extreme than scoreboarding in, say, CDC or Cray processors, or the 88K.

I can't find any "regular" scoreboarding references at hand, but
Sieworik, Bell, and Newell, _Computer Structures, Principles and Examples_
contains papers on the CDC 6600 and CRAY-1.

The distinction between "regular" scoreboarding and "Tomasulo" scoreboarding
is important; roughly described, dumb interlocking halts the entire processor
if a later stage's result isn't ready (at some fixed, designated time)
even if the next instruction doesn't need the output;
regular scoreboarding only halts an instruction if its operands aren't
ready, but doesn't necessarily provide execute past; and Tomasulo scheduling
allows execution of subsequent instructions.
    That's really rough; and I have also learned that "Tomasulo" scheduling
means different things to different people. IE. to some people it means 
Tomasulo's implementation with reservation stations and a common data bus,
but to me it means, basically, a limited form of dataflow computation
in a small window.
    Plus, of course, you don't need to implement a scheduling technique
everywhere. I would call the AMD29000's load scheduling a limited form
of scoreboarding, but in the rest of the pipeline they don't bother.
    And, of course, nobody's pure - below, when I describe "regular"
scoreboarding, I do not mean the scoreboarding on the 88K. The 88K is a bit
more aggressive than that, but not quite as aggressive as Tomasulo.


>*	can existing scoreboards handle chained dependencies?
Tomasulo: yes, limited only by resource constraints (the number of reservation
	stations)
Regular scoreboarding: exactly what do you mean? A->B->C type dependencies?
        Tomasulo would dispatch all instructions to wait for their result;
        regular scoreboarding would halt at the second until the first is
	ready, then halt at the third until the second is ready.
    Or do you mean vector chaining? See how Cray does things. The only reason
        for not applying scoreboarding to vectors is that there are a *lot*
	of registers => a lot of scheduling hardware. But, things like
	N. Jouppi's pseudo-vector operations for TITAN can help.


>*	does an instruction (like R1 OP R2, where R1 ain't ready yet, but where
>	the next instruction could use the same alu and is ready-to-go) 
>	allocate an alu?  or do ya queue the decoded instruction and wait
>	for all of {r1 OP R2} to be available simultaneously?

In Tomasulo scheduling, you queue the instruction in a reservation station.
Typically, you also snarf the inputs as they become ready, so as not to
prevent other instructions that may write to those from dispatching.
(I use a notation like O->I dependencies satisfied, with execute past;
O->O dependencies satisfied, in the extreme with execute past, in less
extreme cases without)
	Of course, if ALUs are cheap enough then your ALU input latches 
become your reservation station. That's OK for integer units, but floating
point units and multipliers still are complicated enough to warrant multiple
reservation stations.
	BTW, reservation stations do not necessarily imply the serial
dependencies of queues.

>*	an assignment to R[12] must therefore be held off till all of the
>	R[12]s contents are either in the alu or in yet
>*	a bad scenario:
>	Rg = some constant
>	[lots of time]
>	mem = R1	(perhaps a "many"-cycle thing)
>	r2 = R1 op Rg
>	r3 = R2 op Rg	(perhaps this is "too hairy")
>	r4 = R3 op Rg
>	r5 = R4 op Rg
>	r6 = R5 op Rg
>	r7 = R6 op Rg	(perhaps generate a "hostile user" trap :-)
>	so seems like if you had finite (not a dataflow machine) scoreboarding 
>	you'd need to teach your compiler how to not exceed it, or maybe
>	you could quit issuing instructions if things got "too hairy", and
>	expect that that'd be infrequent.

You can't dispatch a instruction unless it has a reservation station ready
-- that's your throttling.  It doesn't happen too often (see Hwu). 

There is still benefit from teaching your compiler about your machine.

I think of Tomasulo sheduling as the route to realistic dataflow machines.
The bottleneck in dataflow machines has always been the routing of results
to inputs; with Tomasulo we are learning how to do that easily and efficiently
in VLSI, and then, hopefully, it will scale (note that I do not consider
the common data bus a defining feature).

For this I had better sign myself (I better get used to being a student 
again)

Andy Glew  
IMPACT - Illinois Microprocessor Project using Advanced Compiler Technology
"The Hwuligans"
aglew@urbana.mcd.mot.com

Disclaimer: neither my advisor nor my employer share the opinions above.