Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!bbn!bbn.com!slackey
From: slackey@bbn.com (Stan Lackey)
Newsgroups: comp.arch
Subject: Re: Filling branch delay slot with test
Message-ID: <45219@bbn.COM>
Date: 5 Sep 89 17:23:25 GMT
References: <1432@atanasoff.cs.iastate.edu> <26859@winchester.mips.COM> <1437@atanasoff.cs.iastate.edu>
Sender: news@bbn.COM
Reply-To: slackey@BBN.COM (Stan Lackey)
Distribution: na
Organization: Bolt Beranek and Newman Inc., Cambridge MA
Lines: 38

In article <1437@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>In article <26859> mash@mips.COM (John Mashey) writes:
>}In article <1432> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>}>         AGAIN:   JSUB   FOO_RTN        ; return FOO in R0
>}>                  BEQL   AGAIN          ; try again if we
>}>		  TEST   R0             ;   get zero back
>}>     Although not without complications, it would seem an
>}>     excellent way to have a high branch delay slot fill ratio.

>}Put another way: as much as computer architects would like
>}pipestages whose results are available in advance of their execution,
>}such things are only found in science-fiction......
>     No.  What I was alluding to was "starting down both paths" of the
>     branch and then "dumping the loser".

Another way is to use branch prediction; guess at the direction using
some algorithm (they range from "terrible" to "pretty good") and start
fetching instr's and operands.  At least you only prefetch one path
(single instruction cache port, single instruction decoder, etc).  You
need to be careful about doing things that are hard to undo when you
turn out to be wrong, though.

Other than in the Multiflow Trace, all the algorithms that I know
about have been implemented in hardware.  I wonder if there have been
studies done (or implementations?) of a more conventional architecture
where the branch instruction has information in it (inserted by the
compiler, possibly using runtime statistics) to tell the hardware
which way to predict the branch.

Most of the hardware algorithms work well for inloop situations, where
prediction is done either by looking at the direction of the branch to
keep execution in the loop, or by caching recently-executed branches
and using history analysis.  I am wondering about cases like a long
piece of code (such as a system call) where state is tested to control
flow, and very little is inloop.  Of course, lots of machines which
depend on an instruction cache are going to perform dismally on this
type of code anyway.
-Stan