Path: utzoo!utgpu!jarvis.csri.toronto.edu!db.toronto.edu!jonah
Newsgroups: comp.arch
From: jonah@db.toronto.edu (Jeffrey Lee)
Subject: Re: delayed branch
Message-ID: <1989Aug6.164210.11976@jarvis.csri.toronto.edu>
References: <2246@taux01.UUCP> <1462@l.cc.purdue.edu> <26139@shemp.CS.UCLA.EDU> <7543@cbmvax.UUCP>
Date: 6 Aug 89 20:42:10 GMT

In article <7543@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>        Note that a) branch frequency depends a lot on the architecture, and
>also on the compiler's optimizer; b) slot-filling percentages (often taken
>as a "universal" figure) depend highly on the instruction set of the machine,
>and even the type of code being compiled.

Take a look at the article by Dan Klein about CISC/RISC compiler
construction in the proceedings of the Winter 89 USENIX (San Diego).
He surveyed a number of CISC and RISC compilers and took polls of the
usage frequency of the various instructions.  He found a surprising
degree of similarity of instruction frequency in code generated by C
compilers on various CISC/RISC platforms.  The following table is not
found in the article, but is derived from the information in it:

I/KVax move  arith log   cmp cmpbra  bra  call   (excluding noop)
 1000   402    80   12   133   157    90   126   vax/pcc
 1033   420    81   14   130   158    98   130   vax/tartan
 1158   472    96   15   166   165   103   136   vax/dec
 1262   588   156   20   135   155    83   122   mc68020/gnu
 1331   616   187   21   138   147    86   131   mc68020/sun
 1470   801   127   66*  151   148    67   164   sparc/sun
 1582   829   193   66*   72   170   117   131   88100/greenhills
 1763   969   290   37    37   206   114   153   r2000/mips

 [*: ori is often used to load the lower portion of a register.]

Most of the extra operations added in the RISC code seem to be loads
and stores (lack of memory-memory operations) and arithmetic/logical
operations (adds/shifts to make up for the reduced addressing modes).
Somewhat surprisingly, although the RISC code has more instructions
than the CISC code, it is less than a two-fold expansion in code size.
Adjusting for average instruction size and adding in the data space,
these RISC programs were only 25% to 50% larger than the VAX programs.
(The 68020 programs were often 25% smaller than the VAX code.)

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>Studies have shown (I love that phrase) that one delay can be filled
>70% of the time, and a second 30% of the time (approx, of course).

This paper showed a measured branch and call (static) frequency ranging
from 22.6% to 27.8% for R2000/SPARC/88100 and from 27.5% to 37.4% for
MC68020/VAX.  Loads made up 22% to 31% of the RISC code.  The delay
slot filling situation is somewhat confusing since some of these
processors have interlocked loads while others do not.  Similarly, some
processors conditionally execute the branch delay slot instruction
while others always execute it.  [GOS==Grain-Of-Salt: I may be wrong in
the following.]  The MIPS/R2000 compiler used there [check the paper
for the version number--I don't have a copy handy] filled
load/branch/call [GOS] delay slots about 76% of the time leading to a
14% noop frequency.  The Sun/SPARC compiler has interlocked loads [GOS]
and filled its branch/call [GOS] delay slots about 75% of the time
leading to a 4% noop frequency.  The 88100 has interlocked loads and
conditional delay slots [GOS] leading to 0% NOPs.

Interlocked loads make it easier on the compiler writer:  It means they
don't have to worry about filling delay slots after load instructions.
Unfortunately it also means that they might not try to prefetch data
and end up with slower code as a result (with stalls instead of NOPs).
However, if they do try to prefetch data (and if the interlocks do not
lengthen the cycle time) they could be ahead: a one cycle stall should
be slightly faster than a NOP because it never can cause a cache miss.
Additionally, the lack of NOPs means that a few more ``useful''
instructions can be stored in the same sized cache.

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>There are two problems with filling delay slots.  The first is that
>branches occur on the order of every 3 to 5 instructions - that's
>a lot of slots to fill.  ...

Having conditional delay slots for conditional branches takes some of
the sting out of this.  Branch slots are harder to fill when they are
unconditional:  the delay instruction must be ``useful'' (or harmless)
for both the branch-taken and branch-not-taken cases.  The delay action
can be made ``conditional'' by adding silicon to nullify the action
(e.g. prevent writeback of the results to the register cells or abort a
load/store instruction) in the case of the branch not taken (or vice
versa).  In this situation the branch slot can always be filled for
static branch targets: simply duplicate the branch target instruction
in the delay slot and move the target forward by one instruction.
Alternatively, the delay instruction could be aborted when the branch
IS taken and used if the branch falls through.  Which choice to
implement depends on the expected probability of branch taken versus
not taken.  [Sorry, I have no data on this.  John??]  Of course in
these case the delay instruction is only useful when the branch is
taken (or not taken).

It is better to fill an unconditional branch slot than a conditional
one assuming that the delay action is useful in both cases, however it
is also useful to have the option for conditional delay slots to
statisfy (half-of) the remaining cases.  Whether or not the extra
silicon is justified depends on (a) how often conditional branches are
expected, (b) how often we expect unconditional delay slots to remain
unfilled, and (c) how hard it is to ``nullify'' the delay instructions
without increasing the cycle time (among other things).  [N.B. One
might need some circuitry to nullify partial instructions anyways so
that we can abort execution and restart again in case of a page fault.]

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>  ...The second is that one often branches on
>a value immediately after calculating it (i++; if i > 5 ...).

As to using a value immediately after calculating it: I don't know
about other processors, but on the MIPS/R2000 there is little problem
in this.  The result of any register/immediate integer instruction
(other than multiply or divide) is always available for use as an
operand in the next instruction without any delay.  Having just altered
the value is actually advantageous: it means that the value will be
stored-in or cached-in a register.  Since many branches tend to be
preceeded by increment or decrement instructions rather than multiplies
or divides, there is little problem.  If the value being tested is
coming from memory, you are stuck with the load delay (but may be able
to pre-fetch it some of the time.)

Perhaps someone who has worked on designing RISC chips would care
comment on the cost of adding interlocked loads and conditional delay
slots.

--
Jeff Lee <jonah@db.toronto.edu> or <...!utzoo!utai!jonah>