Path: utzoo!telly!attcan!uunet!tut.cis.ohio-state.edu!BAYES.ARC.NASA.GOV!self From: self@BAYES.ARC.NASA.GOV (Matthew Self) Newsgroups: gnu.gcc.bug Subject: Strength reduction vs Motorola 88k [was more patches to loop.c] Message-ID: <8904060012.AA10478@bayes.arc.nasa.gov> Date: 6 Apr 89 00:12:15 GMT References: <8904052322.AA00710@tiktok.dg.com> Sender: daemon@tut.cis.ohio-state.edu Distribution: gnu Organization: GNUs Not Usenet Lines: 110 Now that strength reduction doesn't cause an error, can somebody point me to some code that actually speeds up with strength reduction? For the 88k compiler, I have not found anything that speeds up, and dhrystone in fact slows down by 1%. This is more curiousity at this stage... On the 68k most any code which uses array indexing will speed up. Things like dhrystone don't always speed up becuase strength reduction has often been done in source code.... (yuck!) I hypothesize that strength reduction is being too general in the way it optimizes things. As I understand strength reduction in general, though not yet the specifics of what GNU does, it trys to replace high cost instructions such as multiplies with lower cost instructions such as adds. Thus for a loop of the form: for (i = 0; i < n; i++) a[i] = 3*b[i]; with a loop of the form: for ((p1 = &a, p2 = &b, i = 0); i < n; (p1 += sizeof(a[0]), p2 += sizeof(b[0]), i++)) *p1 = 3*(*p2); which eliminates the multiplies inheirent in the array indexing. Exactly right. The new patches I added make it go even further and eliminate the variable i entirely. Thus you get: for ((p1 = &a, p2 = &b); p1 < &a + n * sizeof(a[0]); (p1 += sizeof(a[0]), p2 += sizeof(b[0]))) *p1 = 3*(*p2); which saves a register and the cost of incrementing i. However, not all multiplies are high cost. In particular, some multiplies can be folded directly into memory indexing and such, and others that are powers of two can be done simply by the appropriate shifts. Strength reduction should be worthwhile even when multiplies (or shifts) cost the same as adds. In order to implement the original loop on the MIPS CPU, you would have to have one register for i, a, b and n, and also one each for a + sizeof(a[0]) and b + sizeof(b[0]). You would also need two more to do the multiplication in. The fully strength reduced version eliminates the need for the registers for i, a, b, and n. The arithmetic may be no faster, but many registers are saved. I don't know about the 88k. It is possible that indexed addressing uses an adder/shifter which runs in parallel to the main ALU, in which case you might be better of doing more arithetic in the address ALU rather than doing less arithemtic, but in the main CPU. (?) Certainly the register savings will not be so great. Could this be the reason on the 88k that I see slowdown, is that in attempting to optimize things, it replaces a 1 cycle add + indexed multiply into two or three adds spread over multiple registers, which is then adds up to multiple cycles. Or could the 88k multiply instruction be sufficiently fast (4 clocks compared to 1 for add), that strength reduction just doesn't produce the gains that you see on more conventional processors. At present, I haven't had time to go further on this issue. I am not sure about this. Is indexing with multiply really that fast? By far the hardest part of doing optimizations is deciding WHEN to do them, not WHETHER you can do them! GCC doesn't provide a clean way to provide this kind of maxhine dependent info. RMS is trying hard to avoid creating yet another ad-hoc macro to specify this kind of thing. On the 68k, your optimization looks like this before strength reduction (I changed 3 to 300 to prevent gcc from trying to use shift and adds): L5: movel a0@(d0:l:4),d2 ; a[i] -> d2 mulsl #300,d2 movel d2,a1@(d0:l:4) ; d2 -> b[i] addql #1,d0 ; i++ cmpl d0,d1 ; i < n jgt L5 and like this after: L5: movel a1@+,d1 ; *p1++ -> d1 mulsl #300,d1 movel d1,a0@+ ; d1 -> *p2++ cmpl a0,d0 ; p2 < b + n * sizeof(b[0]) jgt L5 On the 68k, autoincrement is much faster than indexing, so this is a big win. Getting rid of i also helps, and we have also saved a register. The strength reduced code really is optimal -- I couldn't do any better by hand. Send me the assembler for the 88k for this program, and I'll take a look. Use -O -S and -O -S -fstrength-reduce: foo (int *a, int *b, int c, int n) { int i; for (i = 0; i < n; i++) a[i] = 300*b[i]; }