Path: utzoo!telly!attcan!uunet!tut.cis.ohio-state.edu!BAYES.ARC.NASA.GOV!self
From: self@BAYES.ARC.NASA.GOV (Matthew Self)
Newsgroups: gnu.gcc.bug
Subject: Strength reduction vs Motorola 88k [was more patches to loop.c]
Message-ID: <8904060012.AA10478@bayes.arc.nasa.gov>
Date: 6 Apr 89 00:12:15 GMT
References: <8904052322.AA00710@tiktok.dg.com>
Sender: daemon@tut.cis.ohio-state.edu
Distribution: gnu
Organization: GNUs Not Usenet
Lines: 110


   Now that strength reduction doesn't cause an error, can somebody point
   me to some code that actually speeds up with strength reduction?  For
   the 88k compiler, I have not found anything that speeds up, and
   dhrystone in fact slows down by 1%.  This is more curiousity at this
   stage...

On the 68k most any code which uses array indexing will speed up.
Things like dhrystone don't always speed up becuase strength reduction
has often been done in source code....  (yuck!)

   I hypothesize that strength reduction is being too general in the way
   it optimizes things.  As I understand strength reduction in general,
   though not yet the specifics of what GNU does, it trys to replace high
   cost instructions such as multiplies with lower cost instructions such
   as adds.  Thus for a loop of the form:

	   for (i = 0; i < n; i++)
		   a[i] = 3*b[i];

   with a loop of the form:

	   for ((p1 = &a, p2 = &b, i = 0);
		i < n;
		(p1 += sizeof(a[0]), p2 += sizeof(b[0]), i++))
		   *p1 = 3*(*p2);

   which eliminates the multiplies inheirent in the array indexing.

Exactly right.  The new patches I added make it go even further and
eliminate the variable i entirely.  Thus you get:

   for ((p1 = &a, p2 = &b);
	p1 < &a + n * sizeof(a[0]);
	(p1 += sizeof(a[0]), p2 += sizeof(b[0])))
	   *p1 = 3*(*p2);

which saves a register and the cost of incrementing i.

   However, not all multiplies are high cost.  In particular, some
   multiplies can be folded directly into memory indexing and such, and
   others that are powers of two can be done simply by the appropriate
   shifts.

Strength reduction should be worthwhile even when multiplies (or
shifts) cost the same as adds.  In order to implement the original
loop on the MIPS CPU, you would have to have one register for i, a, b
and n, and also one each for a + sizeof(a[0]) and b + sizeof(b[0]).
You would also need two more to do the multiplication in.  The fully
strength reduced version eliminates the need for the registers for i,
a, b, and n.  The arithmetic may be no faster, but many registers are
saved.

I don't know about the 88k.  It is possible that indexed addressing
uses an adder/shifter which runs in parallel to the main ALU, in which
case you might be better of doing more arithetic in the address ALU
rather than doing less arithemtic, but in the main CPU.  (?)
Certainly the register savings will not be so great.

   Could this be the reason on the 88k that I see slowdown, is that in
   attempting to optimize things, it replaces a 1 cycle add + indexed
   multiply into two or three adds spread over multiple registers, which
   is then adds up to multiple cycles.  Or could the 88k multiply
   instruction be sufficiently fast (4 clocks compared to 1 for add),
   that strength reduction just doesn't produce the gains that you see on
   more conventional processors.  At present, I haven't had time to go
   further on this issue.

I am not sure about this.  Is indexing with multiply really that fast?
By far the hardest part of doing optimizations is deciding WHEN to do
them, not WHETHER you can do them!  GCC doesn't provide a clean way to
provide this kind of maxhine dependent info.  RMS is trying hard to
avoid creating yet another ad-hoc macro to specify this kind of thing.

  On the 68k, your optimization looks like this before strength
reduction (I changed 3 to 300 to prevent gcc from trying to use shift
and adds):

L5:
	movel a0@(d0:l:4),d2		; a[i] -> d2
	mulsl #300,d2
	movel d2,a1@(d0:l:4)		; d2 -> b[i]
	addql #1,d0			; i++
	cmpl d0,d1			; i < n
	jgt L5

and like this after:

L5:
	movel a1@+,d1			; *p1++ -> d1
	mulsl #300,d1
	movel d1,a0@+			; d1 -> *p2++
	cmpl a0,d0			; p2 < b + n * sizeof(b[0])
	jgt L5

On the 68k, autoincrement is much faster than indexing, so this is a
big win.  Getting rid of i also helps, and we have also saved a
register.  The strength reduced code really is optimal -- I couldn't
do any better by hand.

Send me the assembler for the 88k for this program, and I'll take a
look.  Use -O -S and -O -S -fstrength-reduce:

foo (int *a, int *b, int c, int n)
{
  int i;

  for (i = 0; i < n; i++)
    a[i] = 300*b[i];
}