Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!cwjcc!gatech!ncar!boulder!unicads!les
From: les@unicads.UUCP (Les Milash)
Newsgroups: comp.arch
Subject: Re: Register Scoreboarding
Message-ID: <442@unicads.UUCP>
Date: 12 May 89 20:34:53 GMT
References: <8905120628.AA08299@decwrl.dec.com>
Reply-To: les@unicads.UUCP (Les Milash)
Organization: Unicad  Boulder, CO
Lines: 56

In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
>In article 10198 henry@utzoo.uucp (Henry Spencer) writes:
>>>Clamping the whole CPU on cache miss [sucks]
>>
>>Probably pretty much what they do now:  let access and execution proceed
>>in parallel, assuming the data isn't needed right away.
> 
>Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do
>if there's a data cache miss on a load but there are several instructions
>behind it ....

i'm curious too, but about scoreboarding. 
will somebody please point me at some article that shows what kind of
logic implements scoreboarding?  i have questoins like
*	can existing scoreboards handle chained dependencies?
*	does an instruction (like R1 OP R2, where R1 ain't ready yet, but where
	the next instruction could use the same alu and is ready-to-go) 
	allocate an alu?  or do ya queue the decoded instruction and wait
	for all of {r1 OP R2} to be available simultaneously?
*	an assignment to R[12] must therefore be held off till all of the
	R[12]s contents are either in the alu or in yet
*	a bad scenario:
	Rg = some constant
	[lots of time]
	mem = R1	(perhaps a "many"-cycle thing)
	r2 = R1 op Rg
	r3 = R2 op Rg	(perhaps this is "too hairy")
	r4 = R3 op Rg
	r5 = R4 op Rg
	r6 = R5 op Rg
	r7 = R6 op Rg	(perhaps generate a "hostile user" trap :-)
	so seems like if you had finite (not a dataflow machine) scoreboarding 
	you'd need to teach your compiler how to not exceed it, or maybe
	you could quit issuing instructions if things got "too hairy", and
	expect that that'd be infrequent.

i kind of see scoreboarding as "idiotproofing", (where a MIPSCo compiler is
an example of a non-idiot), in that on "machines that have to have a load
delay slot filled with an independent instruction by the compiler" can run
(wrong) programs that compute with numbers that aren't fetched yet.

i can't myself imagine what kind of logic could be built to handle all the
pathological cases i could think up, unless you can depend on the compiler
to not generate too bad a case.  i can imagine that some finite logic
could help in the "M% of cases that get you the easiest N%", and give up
otherwise by quitting issuing more instructions.
although Mr. Mashey had an argument that you wouldn't get that much (except
of course for being able to R2000 code on the GaAs R7000)

that kind of stuff led me to my attraction with transputers.  i think of 'em 
as the ultimate scoreboarding; but it's at the system level, not in the 
processor chip.  so i realize that doesn't help us run many mips out of slow 
dram.  but the whole program, in all its millions of bytes of glory, is one 
big dependency graph, and the processor assures you that it will be executed 
in a "correctness-preserving" order, even if multiple processors are executing 
on your program.