Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!usc!apple!amdcad!dgcad!dg-rtp!siberia!hamilton From: hamilton@siberia.rtp.dg.com (Eric Hamilton) Newsgroups: comp.arch Subject: Re: Second-generation RISC Message-ID: <1991Mar26.210643.2052@dg-rtp.dg.com> Date: 26 Mar 91 21:06:43 GMT References: <6128@baird.cs.strath.ac.uk> <7425@titcce.cc.titech.ac.jp> <3189@inews.intel.com> <705@seqp4.UUCP> Sender: usenet@dg-rtp.dg.com (Usenet Administration) Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton) Organization: Data General Corporation, Research Triangle Park, NC Lines: 55 In article <705@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeffrey d'Arcy) writes: |> kds@blabla.intel.com (Ken Shoemaker) writes: |> >Delayed |> >branches will probably go away as they really are an artifact of having a |> >short, fixed length pipeline. |> |> As long as there's at least one pipe stage devoted to instruction fetch and |> decode, I think delayed branches would make sense. If this part of the |> pipeline ceases to be fixed length, maybe we'll see multiple-instruction |> branch delays, with a *variable* number of instructions after the branch. |> This is a slightly backwards way of looking at the problem. The question is not how many branch delays are implied by the pipeline structure; that will change from implementation to implementation. The question is how many branch delays can be profitably filled by compilers. Suppose, for example, that we discover that 99% of the time the compiler can profitably fill three branch delay slots. Then, regardless of the pipeline structure of the machine, the delayed branches should support three delay slots. Instead of writing: instr a instr b (executes in four clocks on a non-superscalar branch.n foo machine with one fetch/decode stage) instr c we write: branch.n foo instr a (also executes in four clocks on a non-superscalar instr b machine with one fetch/decode stage) instr c Since, by hypothesis, the compiler can profitably use all three slots nearly all the time we can live with the occasional no-op inserted when all three slots cannot be filled. But look at what happens when we move this code to a two intructions/cycle superscalar... The first sequence now wastes one entire cycle, or two instruction times, and has an effective execution time of three clocks. The second sequence wastes nothing and has an effective execution time of two clocks. The second sequence, in other words, executes on a given implementation at its best possible speed, regardless of the pipeline structure. The moral of the story is that the delayed branching should be designed around the best that the compiler can do, not the idiosyncracies of a particular implementation. The compiler should be able to generate code that uses all the available pipelining, without worrying about precisely how much pipelining that is. ---------------------------------------------------------------------- Eric Hamilton +1 919 248 6172 Data General Corporation hamilton@dg-rtp.rtp.dg.com 62 Alexander Drive ...!mcnc!rti!xyzzy!hamilton Research Triangle Park, NC 27709, USA