Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!uunet!bywater!arnor!prener!prener From: prener@arnor.UUCP (Dan Prener) Newsgroups: comp.arch Subject: Re: Second-generation RISC Message-ID: <1991Mar28.003706.454@arnor.uucp> Date: 28 Mar 91 00:37:06 GMT References: <6128@baird.cs.strath.ac.uk> <7425@titcce.cc.titech.ac.jp> <3189@inews.intel.com> <705@seqp4.UUCP> <1991Mar26.210643.2052@dg-rtp.dg.com> Sender: news@arnor.uucp (NNTP News Poster) Reply-To: prener@prener.watson.ibm.com (Dan Prener) Organization: IBM T.J. Watson Research Center Lines: 36 In article <1991Mar26.210643.2052@dg-rtp.dg.com>, hamilton@siberia.rtp.dg.com (Eric Hamilton) writes: |> |> Suppose, for example, that we discover that 99% of the time the compiler |> can profitably fill three branch delay slots. Then, regardless of the |> pipeline structure of the machine, the delayed branches should support three |> delay slots. Instead of writing: |> |> instr a |> instr b (executes in four clocks on a non-superscalar |> branch.n foo machine with one fetch/decode stage) |> instr c |> |> we write: |> |> branch.n foo |> instr a (also executes in four clocks on a non-superscalar |> instr b machine with one fetch/decode stage) |> instr c |> |> Since, by hypothesis, the compiler can profitably use all three slots nearly all |> the time we can live with the occasional no-op inserted when all three slots |> cannot be filled. |> That argument ignores the second-order effects, which arise from the memory hierarchy. In a machine that doesn't really need three branch delay slots, there will be no gain from having delayed branches with three slots. But there can be dramatic losses. Think of the (admittedly uncommon) cases in which the fetch of a no-op that padded out the third delay slot causes a cache miss, or, even worse, a page fault. So the expected value of the three slot delay on this machine is negative (zero with large probability, some significantly non-zero negative number with small probability) on this machine. On the future superscalar implementation, there is some positive contribution and some negative contribution to the expected value of the delay. So, even there, it is far from clear that it wins. -- Dan Prener (prener @ watson.ibm.com)