Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!mcnc!duke!cameron!dukee!rfk From: rfk@dukee.egr.duke.edu (Robert F Krick) Newsgroups: comp.arch Subject: Re: Filling branch delay slot with test Message-ID: <423@cameron.cs.duke.edu> Date: 12 Sep 89 18:01:48 GMT References: <9881@alice.UUCP> Sender: news@cameron.cs.duke.edu Lines: 105 Indeed a variety of techniques have been incorporated in CRISP to reduce but not eliminate the performance degradation caused by conditional branch instructions in an instruction pipeline. CRISP or any other architecture with a single prefetch and decode pipeline cannot *guarantee* that the sustained performance of the execution unit will equal its theoretical maximum performance for any program. From article <9881@alice.UUCP>, by jmk@alice.UUCP (Jim McKie): > > To paraphrase from 'Architectural Innovations in the CRISP > Microprocessor', Spring COMPCON 87 Proceedings, by > Berenbaum, Ditzel and McLellan: > > The CRISP CPU has 'branch folding': the Decoded Instruction Cache > contains a 'next address' field and also, in the case of conditional > branches, an 'alternate address' field. > Logic in the PDU recognises when a non-branching instruction is followed > by a branching instruction and 'folds' the two instructions together. > This single instruction is then placed in the Decoded Instruction Cache. > The separate branch instruction disappears entirely from the execution > pipeline and the program behaves as if the branch were executed in > zero time. Clearly branch folding is an excellent hardware solution to alleviate the execution unit from the trivial operation of determining the new program counter. However, this scheme breaks down for consecutive conditional branch instructions! > The 'alternate address' field is used to hold the second possible > next instruction addresss for folded conditional branch instructions. > When an instruction folded with a conditional branch is read from > the instruction cache, one of the two paths for the branch is > selected for the next instruction address, and the address that was > not used is retained with the instruction as it proceeds down > the execution pipeline. The alternate address field is retained with > each pipeline stage only until the logic can determine whether the > selected branch path was correct or not. When the outcome of the > branch condition is known, if the wrong next address was selected, > any instructions in progress in the pipeline following the conditional > branch are flushed and the alternate address from the folded > conditional branch is re-introduced as the next instruction address > at the beginning of the execution pipeline. The flush of the instruction pipeline in this scheme results from only retaining the alternate *address* field and not the alternate *instruction* through the instruction pipeline. Whenever the instruction pipeline must be flushed, the performance of the execution unit falls below its theoretical maximum. > Determining which of the two possible next addresses of a conditional > branch is likely to be taken is aided in CRISP with static branch > prediction. The encoding for the conditional branch instruction > contains a single branch prediction bit which can be set by the > compiler. By definition, branch prediction (either static or dynamic) cannot always guess the direction of the branch. Furthermore, the need for branch prediction highlights the fact that branch folding cannot guarantee that the performance of the execution unit will not be degraded. > Branch prediction is useful when a conditional branch instruction > in the pipeline can alter the flow of instructions before the > result of a comparison can be computed. If, however, there are no > compare instructions in the pipeline then there is no need for > branch prediction. Since only the compare instruction may set the > condition code flag, the outcome of the conditional branch is known > with certainty. > CRISP intentionally has separate compare and conditional branch > instructions so that a compiler or optimizer may insert instructions > between the compare and conditional branch (this is 'branch spreading'). > By combining branch folding and code motion in CRISP there can be > no cost at all for either conditional or uncondditional branches. Statistics have shown that changes in program flow occur as frequently as every five instructions (on the average!) which limits the benefit of 'branch spreading'. Furthermore, I believe that the technique of 'branch spreading' can be proved to be equivalent (at best) to filling the delay slots associated with delayed branches (generally acknowledged to be successful only 70% of the time for the first delay slot and much less for subsequent slots.) > > Jim McKie research!jmk -or- jmk@research.att.com Last but not least, any cache misses which occur when prefetching an instruction will idle the execution unit as the instruction is brought from the slower main memory. In contrast, we have developed an architecture and have constructed a functional prototype with multiple instruction prefetch and decode units to guarantee that the next instruction will *always* be prefetched and decoded for execution even in the presence of conditional branch instructions. Furthermore, i-cache misses are completely eliminated by making use of a graph of the program flow. As a result, the execution unit must never wait for a decoded instruction in executing any program. ========================================================================= Robert F. Krick PHONE: (919) 684-3123 x61 Dept. of Electrical Eng. CSNET: rfk@dukee Duke University UUCP: decvax!duke!dukee!rfk Durham, NC 27706 ARPA: rfk@dukee.egr.duke.edu "When the branching gets conditional, the prefetching gets tough, and the tough get going!" ========================================================================= Robert F. Krick PHONE: (919) 684-3123 x61 Dept. of Electrical Eng. CSNET: rfk@dukee Duke University UUCP: decvax!duke!dukee!rfk