Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!sun!exodus!rbbb.Eng.Sun.COM!chased From: chased@rbbb.Eng.Sun.COM (David Chase) Newsgroups: comp.arch Subject: Re: speculative execution Message-ID: <1283@exodus.Eng.Sun.COM> Date: 11 Oct 90 18:04:19 GMT References: <3432@bnr-rsc.UUCP> <1990Oct10.170424.21489@rice.edu> <3436@bnr-rsc.UUCP> Sender: news@exodus.Eng.Sun.COM Organization: Sun Microsystems, Mt. View, Ca. Lines: 56 > = bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) >> = preston@titan.rice.edu (Preston Briggs) >>Well, I wan't going to let M1 use my fabulous scheduling ideas. >>It had to be satidfied with hardware. Further, M2 ought to have a >>higher clock speed since its hardware is simpler. >Ah, but that is not very fair, is it? If code scheduling works for both >M1 & M2, why restrict it to M2 only? Agreed, it is unfair to run the code scheduling only on M2. However, I think the claim is that scheduled code won't run any faster on M1 than unscheduled code -- that's what you added all that hardware for, right? You can think of M1 as a machine that looks far ahead for anything that it can execute (and fires off stalled instructions in a dataflow fashion), whereas M2 is a machine that will only suck up instructions for which the operands are valid and the functional unit is idle (i.e., the first stall stops all subsequent execution) or (Stanford-MIPS style) doesn't even stall -- just rely on the compiler to insert NOPS. "Clearly" the hardware for M2 is simpler, but it will run just as quickly if the compiler does the lookahead and reorders the instructions into the order that M1 would have executed them. Run that same code through M1, and "in theory" it executes the instructions in the same way as it did before they were reordered -- no speedup, but lots more chip real estate, and a possibly slower cycle time. Now, M1 does win in a couple of ways, including one very important one that I forgot to mention: 1) M1 can look like a chip that people already use (customers are a minor detail). 2) M1 can take advantage of early instruction completion (i.e., division, or a first-level-cache hit on load) instead of assuming worst case. 3) M1 can do (dynamic) prefetch prediction in some limited way, and thus only needs to suck up instructions from *one* potential successor, instead of all of them. Counter-proposals to (3) for M2 include profiling feedback to the compiler, and the possibility (I haven't worked out the details) of exposing the branch history as a value for the compiler to use. I suspect this involves a wee bit of code expansion, or the use of "conditional execution" bits on every instruction in the style of the Acorn Risc Machine. Use your history of the impending branch to determine which instructions to execute on your way to it -- if you guessed wrong, then you use the history bits to figure out which fix-up code to execute (actually, you could have several different conditional branches, themselves conditional on the history bits, so that you'd choose the right one with respect to fixup code). David Chase Sun Microsystems