Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!uxc!uxc.cso.uiuc.edu!mcdurb!aglew From: aglew@mcdurb.Urbana.Gould.COM Newsgroups: comp.arch Subject: Re: Register Scoreboarding Message-ID: <28200312@mcdurb> Date: 13 May 89 15:13:00 GMT References: <24821@lll-winken.LLNL.GOV> Lines: 114 Nf-ID: #R:lll-winken.LLNL.GOV:24821:mcdurb:28200312:000:5603 Nf-From: mcdurb.Urbana.Gould.COM!aglew May 13 10:13:00 1989 >i'm curious too, but about scoreboarding. >will somebody please point me at some article that shows what kind of >logic implements scoreboarding? i have questoins like Mark Johnson pointed out an Intel ISSCC (I always get those initials wrong) paper on circuit implementations of scoreboarding for the '960. Here are a few references at a slightly higher level of abstraction: Keller, "Look-Ahead Processors", Computing Surveys Vol 7 No 4 December 1975, pp 177-195 (good bibliography). Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal Resarch and DEvelopment, 11, 1 (Jan 1967) pp 25-33 Hwu, "Exploiting Instruction Concurrency to Acheive High Performance in a Single-chip Microprocessor", PhD thesis, Report No. UCB/CSD 88/398, University of California, Berkeley, January 1988 (sinilar papers in COMP ARCH 13 and IEEE Trans Comp) The above references are for varieties of "Tomasulo" scheduling, which is a bit more extreme than scoreboarding in, say, CDC or Cray processors, or the 88K. I can't find any "regular" scoreboarding references at hand, but Sieworik, Bell, and Newell, _Computer Structures, Principles and Examples_ contains papers on the CDC 6600 and CRAY-1. The distinction between "regular" scoreboarding and "Tomasulo" scoreboarding is important; roughly described, dumb interlocking halts the entire processor if a later stage's result isn't ready (at some fixed, designated time) even if the next instruction doesn't need the output; regular scoreboarding only halts an instruction if its operands aren't ready, but doesn't necessarily provide execute past; and Tomasulo scheduling allows execution of subsequent instructions. That's really rough; and I have also learned that "Tomasulo" scheduling means different things to different people. IE. to some people it means Tomasulo's implementation with reservation stations and a common data bus, but to me it means, basically, a limited form of dataflow computation in a small window. Plus, of course, you don't need to implement a scheduling technique everywhere. I would call the AMD29000's load scheduling a limited form of scoreboarding, but in the rest of the pipeline they don't bother. And, of course, nobody's pure - below, when I describe "regular" scoreboarding, I do not mean the scoreboarding on the 88K. The 88K is a bit more aggressive than that, but not quite as aggressive as Tomasulo. >* can existing scoreboards handle chained dependencies? Tomasulo: yes, limited only by resource constraints (the number of reservation stations) Regular scoreboarding: exactly what do you mean? A->B->C type dependencies? Tomasulo would dispatch all instructions to wait for their result; regular scoreboarding would halt at the second until the first is ready, then halt at the third until the second is ready. Or do you mean vector chaining? See how Cray does things. The only reason for not applying scoreboarding to vectors is that there are a *lot* of registers => a lot of scheduling hardware. But, things like N. Jouppi's pseudo-vector operations for TITAN can help. >* does an instruction (like R1 OP R2, where R1 ain't ready yet, but where > the next instruction could use the same alu and is ready-to-go) > allocate an alu? or do ya queue the decoded instruction and wait > for all of {r1 OP R2} to be available simultaneously? In Tomasulo scheduling, you queue the instruction in a reservation station. Typically, you also snarf the inputs as they become ready, so as not to prevent other instructions that may write to those from dispatching. (I use a notation like O->I dependencies satisfied, with execute past; O->O dependencies satisfied, in the extreme with execute past, in less extreme cases without) Of course, if ALUs are cheap enough then your ALU input latches become your reservation station. That's OK for integer units, but floating point units and multipliers still are complicated enough to warrant multiple reservation stations. BTW, reservation stations do not necessarily imply the serial dependencies of queues. >* an assignment to R[12] must therefore be held off till all of the > R[12]s contents are either in the alu or in yet >* a bad scenario: > Rg = some constant > [lots of time] > mem = R1 (perhaps a "many"-cycle thing) > r2 = R1 op Rg > r3 = R2 op Rg (perhaps this is "too hairy") > r4 = R3 op Rg > r5 = R4 op Rg > r6 = R5 op Rg > r7 = R6 op Rg (perhaps generate a "hostile user" trap :-) > so seems like if you had finite (not a dataflow machine) scoreboarding > you'd need to teach your compiler how to not exceed it, or maybe > you could quit issuing instructions if things got "too hairy", and > expect that that'd be infrequent. You can't dispatch a instruction unless it has a reservation station ready -- that's your throttling. It doesn't happen too often (see Hwu). There is still benefit from teaching your compiler about your machine. I think of Tomasulo sheduling as the route to realistic dataflow machines. The bottleneck in dataflow machines has always been the routing of results to inputs; with Tomasulo we are learning how to do that easily and efficiently in VLSI, and then, hopefully, it will scale (note that I do not consider the common data bus a defining feature). For this I had better sign myself (I better get used to being a student again) Andy Glew IMPACT - Illinois Microprocessor Project using Advanced Compiler Technology "The Hwuligans" aglew@urbana.mcd.mot.com Disclaimer: neither my advisor nor my employer share the opinions above.