Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!cwjcc!gatech!ncar!boulder!unicads!les From: les@unicads.UUCP (Les Milash) Newsgroups: comp.arch Subject: Re: Register Scoreboarding Message-ID: <442@unicads.UUCP> Date: 12 May 89 20:34:53 GMT References: <8905120628.AA08299@decwrl.dec.com> Reply-To: les@unicads.UUCP (Les Milash) Organization: Unicad Boulder, CO Lines: 56 In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes: >In article 10198 henry@utzoo.uucp (Henry Spencer) writes: >>>Clamping the whole CPU on cache miss [sucks] >> >>Probably pretty much what they do now: let access and execution proceed >>in parallel, assuming the data isn't needed right away. > >Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do >if there's a data cache miss on a load but there are several instructions >behind it .... i'm curious too, but about scoreboarding. will somebody please point me at some article that shows what kind of logic implements scoreboarding? i have questoins like * can existing scoreboards handle chained dependencies? * does an instruction (like R1 OP R2, where R1 ain't ready yet, but where the next instruction could use the same alu and is ready-to-go) allocate an alu? or do ya queue the decoded instruction and wait for all of {r1 OP R2} to be available simultaneously? * an assignment to R[12] must therefore be held off till all of the R[12]s contents are either in the alu or in yet * a bad scenario: Rg = some constant [lots of time] mem = R1 (perhaps a "many"-cycle thing) r2 = R1 op Rg r3 = R2 op Rg (perhaps this is "too hairy") r4 = R3 op Rg r5 = R4 op Rg r6 = R5 op Rg r7 = R6 op Rg (perhaps generate a "hostile user" trap :-) so seems like if you had finite (not a dataflow machine) scoreboarding you'd need to teach your compiler how to not exceed it, or maybe you could quit issuing instructions if things got "too hairy", and expect that that'd be infrequent. i kind of see scoreboarding as "idiotproofing", (where a MIPSCo compiler is an example of a non-idiot), in that on "machines that have to have a load delay slot filled with an independent instruction by the compiler" can run (wrong) programs that compute with numbers that aren't fetched yet. i can't myself imagine what kind of logic could be built to handle all the pathological cases i could think up, unless you can depend on the compiler to not generate too bad a case. i can imagine that some finite logic could help in the "M% of cases that get you the easiest N%", and give up otherwise by quitting issuing more instructions. although Mr. Mashey had an argument that you wouldn't get that much (except of course for being able to R2000 code on the GaAs R7000) that kind of stuff led me to my attraction with transputers. i think of 'em as the ultimate scoreboarding; but it's at the system level, not in the processor chip. so i realize that doesn't help us run many mips out of slow dram. but the whole program, in all its millions of bytes of glory, is one big dependency graph, and the processor assures you that it will be executed in a "correctness-preserving" order, even if multiple processors are executing on your program.