Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!titan!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: delayed branch (& delayed loads!)
Message-ID: <408@brazos.Rice.edu>
Date: 2 Aug 89 20:06:29 GMT
References: <2246@taux01.UUCP> <1462@l.cc.purdue.edu> <26139@shemp.CS.UCLA.EDU> <33669@apple.Apple.COM>
Sender: root@rice.edu
Reply-To: preston@titan.rice.edu (Preston Briggs)
Organization: Rice University, Houston
Lines: 37

In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>An interesting point to ponder is that delayed loads also have to schedule
>something into their shadows, and they are (very roughly) as frequent as 
>branches. Now, there are only so many instructions that can be re-arranged,
>so it is possible that although branches shadows can be filled 70% of the time,
>and load shadows can be filled 70% of the time, it may not be true that you can
>fill both of the 70% of the time. Any comments from someone who has actually
>measured this? It might be interesting to turn off filling of one, and see how
>the percentage of filling the other increases.

Martin Hopkins, writing about the PL.8 compiler and the 801,
points out that it's especially nice if you can schedule a
load in the branch "shadow" (great term by the way).
That way the load completes while the branch is branching.
The PL.8 compiler does some non-trivial rearrangement 
(really rewriting) to achieve this.

I schedule for both cases, but I don't have any good numbers I can
give you now (haven't measured the results recently; wish I had a pixie).
My impression is that I do better on branchs than loads.

On the RT, the "flavor" of the two problems is a little
different.  Loads cast a long shadow and I try to avoid
scheduling instructions that use the loaded register during the shadow.
For branches, I try and find an instruction that doesn't
interfere with the CC and can execute at the end of the block.
They aren't necessarily competing goals.

An easy example is a mem-mem copy, that is a LOAD followed by a STORE.
The STORE can be scheduled in the branch slot, and the LOAD
as early as possible (up to 6 cycles anyway).

On the other hand, your point seems valid.
I'm just not sure of an easy way to measure the effect.

Regards,
Preston Briggs