Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!decwrl!pyramid!voder!apple!bcase From: bcase@apple.UUCP (Brian Case) Newsgroups: comp.arch Subject: Re: register windows Message-ID: <6571@apple.UUCP> Date: Thu, 29-Oct-87 13:13:34 EST Article-I.D.: apple.6571 Posted: Thu Oct 29 13:13:34 1987 Date-Received: Wed, 4-Nov-87 01:46:13 EST References: <201@PT.CS.CMU.EDU> <28200058@ccvaxa> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc., Cupertino, USA Lines: 47 In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes: > >..> Tim Olson posts comparisons of load/stores, and load/stores >..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little). > >How do you count load/stores with non-zero offsets for the AMD29000, >seeing as it has no such addressing mode? Was this for a design >alternative, that did have indexing? Or do you count by looking at the >number of loads that have an add immediately preceding to the register >used in the address of the load/store? Variations of this last method >would seem to me to be very susceptible to masking by code rearrangement >in the optimizer. > >A while back I posted static counts of address constants on VAX 4.2 BSD. >As I remember it, about 40% were for locals (stack offsets), but 20% were >for pointer+field_offset. How do you optimize those away? Well, I can tell you that, at least for the original 29000, an offset addressing mode was NEVER a design alternative. I would have had a tantrum. Anyway, when Tim told me that he was collecting these statistics, one of my first thoughts was: "But wait: optimization will move some of the add instructions away from the object load or store instruction. That will skew the results!" Upon further thought, I realized that if optimization can move the add instruction away (out of a loop or to fill a branch delay or load/store delay slot), then it *shouldn't* be counted anyway. The more of these that happen, the more sense it makes not to have the offset addressing mode. In other words, lets count only those offset additions that would be *profitably* integrated with the load or store instruction itself. Even though the numbers Tim collected are low, I believe that a real optimizing compiler (coming soon I hear) will produce even lower (if only marginally lower) numbers. My compiler does not do load/store latency scheduling; if it did, more opportunities to overlap the offset add would arise. In retrospect, large performance increases could have been realized, especially for the 29000 with VDRAM memory configurations (this has been proven by hand coding; sigh, one has only so much time...). I'll reiterate: you just can't say things like: "processors need to have x, no matter what the rest of the architecture is like." The offset addressing mode has its own costs. From the statistics collected from the output of my (not-wonderfully-optimizing) compiler, one is led to believe that an offset addressing mode would have little positive impact on the 29000. Further, it would only add another delay in the memory addressing path. I was very glad to see the numbers you collected from studying the VAX. Were those dynamic measurements?