Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!titan!preston From: preston@titan.rice.edu (Preston Briggs) Newsgroups: comp.arch Subject: Re: delayed branch (& delayed loads!) Message-ID: <408@brazos.Rice.edu> Date: 2 Aug 89 20:06:29 GMT References: <2246@taux01.UUCP> <1462@l.cc.purdue.edu> <26139@shemp.CS.UCLA.EDU> <33669@apple.Apple.COM> Sender: root@rice.edu Reply-To: preston@titan.rice.edu (Preston Briggs) Organization: Rice University, Houston Lines: 37 In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >An interesting point to ponder is that delayed loads also have to schedule >something into their shadows, and they are (very roughly) as frequent as >branches. Now, there are only so many instructions that can be re-arranged, >so it is possible that although branches shadows can be filled 70% of the time, >and load shadows can be filled 70% of the time, it may not be true that you can >fill both of the 70% of the time. Any comments from someone who has actually >measured this? It might be interesting to turn off filling of one, and see how >the percentage of filling the other increases. Martin Hopkins, writing about the PL.8 compiler and the 801, points out that it's especially nice if you can schedule a load in the branch "shadow" (great term by the way). That way the load completes while the branch is branching. The PL.8 compiler does some non-trivial rearrangement (really rewriting) to achieve this. I schedule for both cases, but I don't have any good numbers I can give you now (haven't measured the results recently; wish I had a pixie). My impression is that I do better on branchs than loads. On the RT, the "flavor" of the two problems is a little different. Loads cast a long shadow and I try to avoid scheduling instructions that use the loaded register during the shadow. For branches, I try and find an instruction that doesn't interfere with the CC and can execute at the end of the block. They aren't necessarily competing goals. An easy example is a mem-mem copy, that is a LOAD followed by a STORE. The STORE can be scheduled in the branch slot, and the LOAD as early as possible (up to 6 cycles anyway). On the other hand, your point seems valid. I'm just not sure of an easy way to measure the effect. Regards, Preston Briggs