Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!decwrl!pyramid!voder!apple!bcase
From: bcase@apple.UUCP (Brian Case)
Newsgroups: comp.arch
Subject: Re: register windows
Message-ID: <6571@apple.UUCP>
Date: Thu, 29-Oct-87 13:13:34 EST
Article-I.D.: apple.6571
Posted: Thu Oct 29 13:13:34 1987
Date-Received: Wed, 4-Nov-87 01:46:13 EST
References: <201@PT.CS.CMU.EDU> <28200058@ccvaxa>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc., Cupertino, USA
Lines: 47

In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>..> Tim Olson posts comparisons of load/stores, and load/stores
>..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little).
>
>How do you count load/stores with non-zero offsets for the AMD29000,
>seeing as it has no such addressing mode? Was this for a design
>alternative, that did have indexing? Or do you count by looking at the
>number of loads that have an add immediately preceding to the register
>used in the address of the load/store? Variations of this last method
>would seem to me to be very susceptible to masking by code rearrangement
>in the optimizer.
>
>A while back I posted static counts of address constants on VAX 4.2 BSD.
>As I remember it, about 40% were for locals (stack offsets), but 20% were
>for pointer+field_offset. How do you optimize those away?

Well, I can tell you that, at least for the original 29000, an offset
addressing mode was NEVER a design alternative.  I would have had a
tantrum.  Anyway, when Tim told me that he was collecting these statistics,
one of my first thoughts was:  "But wait:  optimization will move some
of the add instructions away from the object load or store instruction.
That will skew the results!"  Upon further thought, I realized that if
optimization can move the add instruction away (out of a loop or to fill
a branch delay or load/store delay slot), then it *shouldn't* be counted
anyway.  The more of these that happen, the more sense it makes not to
have the offset addressing mode.  In other words, lets count only those
offset additions that would be *profitably* integrated with the load or
store instruction itself.  Even though the numbers Tim collected are low,
I believe that a real optimizing compiler (coming soon I hear) will
produce even lower (if only marginally lower) numbers.  My compiler does
not do load/store latency scheduling; if it did, more opportunities to
overlap the offset add would arise.  In retrospect, large performance
increases could have been realized, especially for the 29000 with VDRAM
memory configurations (this has been proven by hand coding; sigh, one
has only so much time...).

I'll reiterate:  you just can't say things like:  "processors need to
have x, no matter what the rest of the architecture is like."  The offset
addressing mode has its own costs.  From the statistics collected from the
output of my (not-wonderfully-optimizing) compiler, one is led to believe
that an offset addressing mode would have little positive impact on the
29000.  Further, it would only add another delay in the memory addressing
path.

I was very glad to see the numbers you collected from studying the VAX.
Were those dynamic measurements?