Path: utzoo!utgpu!jarvis.csri.toronto.edu!db.toronto.edu!jonah Newsgroups: comp.arch From: jonah@db.toronto.edu (Jeffrey Lee) Subject: Re: delayed branch Message-ID: <1989Aug6.164210.11976@jarvis.csri.toronto.edu> References: <2246@taux01.UUCP> <1462@l.cc.purdue.edu> <26139@shemp.CS.UCLA.EDU> <7543@cbmvax.UUCP> Date: 6 Aug 89 20:42:10 GMT In article <7543@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > Note that a) branch frequency depends a lot on the architecture, and >also on the compiler's optimizer; b) slot-filling percentages (often taken >as a "universal" figure) depend highly on the instruction set of the machine, >and even the type of code being compiled. Take a look at the article by Dan Klein about CISC/RISC compiler construction in the proceedings of the Winter 89 USENIX (San Diego). He surveyed a number of CISC and RISC compilers and took polls of the usage frequency of the various instructions. He found a surprising degree of similarity of instruction frequency in code generated by C compilers on various CISC/RISC platforms. The following table is not found in the article, but is derived from the information in it: I/KVax move arith log cmp cmpbra bra call (excluding noop) 1000 402 80 12 133 157 90 126 vax/pcc 1033 420 81 14 130 158 98 130 vax/tartan 1158 472 96 15 166 165 103 136 vax/dec 1262 588 156 20 135 155 83 122 mc68020/gnu 1331 616 187 21 138 147 86 131 mc68020/sun 1470 801 127 66* 151 148 67 164 sparc/sun 1582 829 193 66* 72 170 117 131 88100/greenhills 1763 969 290 37 37 206 114 153 r2000/mips [*: ori is often used to load the lower portion of a register.] Most of the extra operations added in the RISC code seem to be loads and stores (lack of memory-memory operations) and arithmetic/logical operations (adds/shifts to make up for the reduced addressing modes). Somewhat surprisingly, although the RISC code has more instructions than the CISC code, it is less than a two-fold expansion in code size. Adjusting for average instruction size and adding in the data space, these RISC programs were only 25% to 50% larger than the VAX programs. (The 68020 programs were often 25% smaller than the VAX code.) In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >Studies have shown (I love that phrase) that one delay can be filled >70% of the time, and a second 30% of the time (approx, of course). This paper showed a measured branch and call (static) frequency ranging from 22.6% to 27.8% for R2000/SPARC/88100 and from 27.5% to 37.4% for MC68020/VAX. Loads made up 22% to 31% of the RISC code. The delay slot filling situation is somewhat confusing since some of these processors have interlocked loads while others do not. Similarly, some processors conditionally execute the branch delay slot instruction while others always execute it. [GOS==Grain-Of-Salt: I may be wrong in the following.] The MIPS/R2000 compiler used there [check the paper for the version number--I don't have a copy handy] filled load/branch/call [GOS] delay slots about 76% of the time leading to a 14% noop frequency. The Sun/SPARC compiler has interlocked loads [GOS] and filled its branch/call [GOS] delay slots about 75% of the time leading to a 4% noop frequency. The 88100 has interlocked loads and conditional delay slots [GOS] leading to 0% NOPs. Interlocked loads make it easier on the compiler writer: It means they don't have to worry about filling delay slots after load instructions. Unfortunately it also means that they might not try to prefetch data and end up with slower code as a result (with stalls instead of NOPs). However, if they do try to prefetch data (and if the interlocks do not lengthen the cycle time) they could be ahead: a one cycle stall should be slightly faster than a NOP because it never can cause a cache miss. Additionally, the lack of NOPs means that a few more ``useful'' instructions can be stored in the same sized cache. In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >There are two problems with filling delay slots. The first is that >branches occur on the order of every 3 to 5 instructions - that's >a lot of slots to fill. ... Having conditional delay slots for conditional branches takes some of the sting out of this. Branch slots are harder to fill when they are unconditional: the delay instruction must be ``useful'' (or harmless) for both the branch-taken and branch-not-taken cases. The delay action can be made ``conditional'' by adding silicon to nullify the action (e.g. prevent writeback of the results to the register cells or abort a load/store instruction) in the case of the branch not taken (or vice versa). In this situation the branch slot can always be filled for static branch targets: simply duplicate the branch target instruction in the delay slot and move the target forward by one instruction. Alternatively, the delay instruction could be aborted when the branch IS taken and used if the branch falls through. Which choice to implement depends on the expected probability of branch taken versus not taken. [Sorry, I have no data on this. John??] Of course in these case the delay instruction is only useful when the branch is taken (or not taken). It is better to fill an unconditional branch slot than a conditional one assuming that the delay action is useful in both cases, however it is also useful to have the option for conditional delay slots to statisfy (half-of) the remaining cases. Whether or not the extra silicon is justified depends on (a) how often conditional branches are expected, (b) how often we expect unconditional delay slots to remain unfilled, and (c) how hard it is to ``nullify'' the delay instructions without increasing the cycle time (among other things). [N.B. One might need some circuitry to nullify partial instructions anyways so that we can abort execution and restart again in case of a page fault.] In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: > ...The second is that one often branches on >a value immediately after calculating it (i++; if i > 5 ...). As to using a value immediately after calculating it: I don't know about other processors, but on the MIPS/R2000 there is little problem in this. The result of any register/immediate integer instruction (other than multiply or divide) is always available for use as an operand in the next instruction without any delay. Having just altered the value is actually advantageous: it means that the value will be stored-in or cached-in a register. Since many branches tend to be preceeded by increment or decrement instructions rather than multiplies or divides, there is little problem. If the value being tested is coming from memory, you are stuck with the load delay (but may be able to pre-fetch it some of the time.) Perhaps someone who has worked on designing RISC chips would care comment on the cost of adding interlocked loads and conditional delay slots. -- Jeff Lee or <...!utzoo!utai!jonah>