Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!agate!eos!ames!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: RISC vs unaligned data
Message-ID: <16407@winchester.mips.COM>
Date: 2 Apr 89 02:24:07 GMT
References: <355@bnr-fos.UUCP> <13@microsoft.UUCP> <16058@cup.portal.com> <370@bnr-fos.UUCP> <11222@tekecs.GWD.TEK.COM>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 64

In article <11222@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes:
>[]

>Many contributors to this discussion seem to hold the opinion that, if
>alignment isn't supported by hardware, it isn't supported at all.  But
>one of the points of RISC is to move complexity from hardware to
>software.  Why not just let the compiler do it?

This is a reasonable generic argument.  As usual, whether it's a good
idea or not depends on the numbers.

>For example, on the 88k, an architecture that doesn't have particularly
>good support for unaligned data, the compiler might generate code like
>this to fetch a word from an address that it knows will be odd:
....
>If the word is in the data cache, this takes seven cycles and wastes
>two scratch registers (r2 and r3).  (The code to fetch from an even but
>unaligned address takes five cycles.)  With hardware support it could
>do a better job ... but is it necessary to fetch an unaligned word in
>fewer than seven cycles?  That fetch takes fewer nanoseconds than it
>does on the modern, unalignment-forgiving CISC machine that I'm typing
>this on, which after all is the bottom line in RISC vs CISC.

It would be useful to cite: 1) the cycle counts for stores,
2) the cycle counts for loads/stores where the compiler has no idea
(FORTRAN call-by-reference, for example).

I don't have much data on this, but I've heard that we've seen FORTRAN
programs where we see 10-15% hit, when compiled with the the
"unaligned-forgiving" attribute (I don't know which one, there are
several).  Let's try some back of the envelope numbers.
Let's suppose it costs us 1-2 extra cycles per load or store.
This means that if we're seeing 10-15%, that somewhere betwen
5-15%  (i.e., 10/2, 15/1 to get the maximum range) of the instructions
are incurring this penalty, which is about 16-50% of the load/store
instructions that are typical for such programs.
Then, if the average penalty is N cycles, one is looking at a
first-order estimate of additional run time (beyond the optimal base of
1.0), as .05*N - .15*N.  Suppose N == 7, which gives a range of .35-1.05
extra time.  The code size would also expand, although probably not
as much.

Of course, if N == 100 (if you were doing it with exceptions, perhaps),
you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and
only survivable for debugging.

What this says is:
1) As Andrew says, you can maybe survive by generating the extra code
to do this.
2) Depending on your code sequences, and the frequency of this problem,
you might get away with 10-15% hit (as on MIPS, with unaligned instructions),
or, in more typical cases, and looking at these numbers, I'd guess
that a 50% hit would be typical, if I had to pick a single number.
A 50% hit is either a) irrelevant, or b) Very Important, depending on
what you're doing.
50% hits in big CAD crunchers are sometimes considered Bad....

Since this is done from rather minimal input, maybe somebody with real
data might choose it post it?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086