Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!agate!eos!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: RISC vs unaligned data Message-ID: <16407@winchester.mips.COM> Date: 2 Apr 89 02:24:07 GMT References: <355@bnr-fos.UUCP> <13@microsoft.UUCP> <16058@cup.portal.com> <370@bnr-fos.UUCP> <11222@tekecs.GWD.TEK.COM> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 64 In article <11222@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes: >[] >Many contributors to this discussion seem to hold the opinion that, if >alignment isn't supported by hardware, it isn't supported at all. But >one of the points of RISC is to move complexity from hardware to >software. Why not just let the compiler do it? This is a reasonable generic argument. As usual, whether it's a good idea or not depends on the numbers. >For example, on the 88k, an architecture that doesn't have particularly >good support for unaligned data, the compiler might generate code like >this to fetch a word from an address that it knows will be odd: .... >If the word is in the data cache, this takes seven cycles and wastes >two scratch registers (r2 and r3). (The code to fetch from an even but >unaligned address takes five cycles.) With hardware support it could >do a better job ... but is it necessary to fetch an unaligned word in >fewer than seven cycles? That fetch takes fewer nanoseconds than it >does on the modern, unalignment-forgiving CISC machine that I'm typing >this on, which after all is the bottom line in RISC vs CISC. It would be useful to cite: 1) the cycle counts for stores, 2) the cycle counts for loads/stores where the compiler has no idea (FORTRAN call-by-reference, for example). I don't have much data on this, but I've heard that we've seen FORTRAN programs where we see 10-15% hit, when compiled with the the "unaligned-forgiving" attribute (I don't know which one, there are several). Let's try some back of the envelope numbers. Let's suppose it costs us 1-2 extra cycles per load or store. This means that if we're seeing 10-15%, that somewhere betwen 5-15% (i.e., 10/2, 15/1 to get the maximum range) of the instructions are incurring this penalty, which is about 16-50% of the load/store instructions that are typical for such programs. Then, if the average penalty is N cycles, one is looking at a first-order estimate of additional run time (beyond the optimal base of 1.0), as .05*N - .15*N. Suppose N == 7, which gives a range of .35-1.05 extra time. The code size would also expand, although probably not as much. Of course, if N == 100 (if you were doing it with exceptions, perhaps), you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and only survivable for debugging. What this says is: 1) As Andrew says, you can maybe survive by generating the extra code to do this. 2) Depending on your code sequences, and the frequency of this problem, you might get away with 10-15% hit (as on MIPS, with unaligned instructions), or, in more typical cases, and looking at these numbers, I'd guess that a 50% hit would be typical, if I had to pick a single number. A 50% hit is either a) irrelevant, or b) Very Important, depending on what you're doing. 50% hits in big CAD crunchers are sometimes considered Bad.... Since this is done from rather minimal input, maybe somebody with real data might choose it post it? -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086