Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!sun!exodus!rbbb.Eng.Sun.COM!chased
From: chased@rbbb.Eng.Sun.COM (David Chase)
Newsgroups: comp.arch
Subject: Re: speculative execution
Message-ID: <1283@exodus.Eng.Sun.COM>
Date: 11 Oct 90 18:04:19 GMT
References: <3432@bnr-rsc.UUCP> <1990Oct10.170424.21489@rice.edu> <3436@bnr-rsc.UUCP>
Sender: news@exodus.Eng.Sun.COM
Organization: Sun Microsystems, Mt. View, Ca.
Lines: 56

>  = bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow)
>> = preston@titan.rice.edu (Preston Briggs)

>>Well, I wan't going to let M1 use my fabulous scheduling ideas.
>>It had to be satidfied with hardware.  Further, M2 ought to have a
>>higher clock speed since its hardware is simpler.

>Ah, but that is not very fair, is it? If code scheduling works for both
>M1 & M2, why restrict it to M2 only?

Agreed, it is unfair to run the code scheduling only on M2.  However,
I think the claim is that scheduled code won't run any faster on M1
than unscheduled code -- that's what you added all that hardware for,
right?  You can think of M1 as a machine that looks far ahead for
anything that it can execute (and fires off stalled instructions in a
dataflow fashion), whereas M2 is a machine that will only suck up
instructions for which the operands are valid and the functional unit
is idle (i.e., the first stall stops all subsequent execution) or
(Stanford-MIPS style) doesn't even stall -- just rely on the compiler
to insert NOPS.

"Clearly" the hardware for M2 is simpler, but it will run just as
quickly if the compiler does the lookahead and reorders the
instructions into the order that M1 would have executed them.  Run
that same code through M1, and "in theory" it executes the instructions
in the same way as it did before they were reordered -- no speedup,
but lots more chip real estate, and a possibly slower cycle time.

Now, M1 does win in a couple of ways, including one very important one
that I forgot to mention:

1) M1 can look like a chip that people already use (customers are a
   minor detail). 

2) M1 can take advantage of early instruction completion (i.e.,
   division, or a first-level-cache hit on load) instead of assuming
   worst case.

3) M1 can do (dynamic) prefetch prediction in some limited way, and
   thus only needs to suck up instructions from *one* potential
   successor, instead of all of them.

Counter-proposals to (3) for M2 include profiling feedback to the
compiler, and the possibility (I haven't worked out the details) of
exposing the branch history as a value for the compiler to use.  I
suspect this involves a wee bit of code expansion, or the use of
"conditional execution" bits on every instruction in the style of the
Acorn Risc Machine.  Use your history of the impending branch to
determine which instructions to execute on your way to it -- if you
guessed wrong, then you use the history bits to figure out which
fix-up code to execute (actually, you could have several different
conditional branches, themselves conditional on the history bits, so
that you'd choose the right one with respect to fixup code).

David Chase
Sun Microsystems