Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!titan!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: 80486 vs. 68040 code size [really: how many regs]
Message-ID: <3249@kalliope.rice.edu>
Date: 11 May 89 19:03:19 GMT
References: <927@aber-cs.UUCP>
Sender: usenet@rice.edu
Reply-To: preston@titan.rice.edu (Preston Briggs)
Distribution: eunet,world
Organization: Rice University, Houston
Lines: 156

Why I like registers

1) My code generators have three parts:
instruction selection, register allocation, and instruction scheduling.

These are all difficult problems; each is NP-Complete.
Additionally, they interfere with each other.
The best instruction depends on the register class of the operands,
pipeline scheduling increases register lifetimes, and so forth.

Lots of registers provides a simplifying separation of concerns.

With adequate registers: 
  I can choose instructions without
    without worrying about running out of a particular class;
  I can schedule instructions before register allocation without 
    the artificial anti-dependences introduced by register allocation
    and without worrying about eccessively long register lifetimes; and
  I can use some sort of global (intra-routine) coloring allocator
    to avoid ad hoc local methods.

2) The above arguments generally apply to integer registers.
I also like lots of FP registers.  These are harder to use;
generally FP values are hidden in arrays where optimizers
can't safely get at them.  By using (more expensive) dependence-based
optimization, a variety of transformation can be applied
to use (very profitably) almost any number of FP registers,
particularly with "typical, numeric Fortran."  From local
experiments, we want more than 16 FP registers, perhaps 32
is adequate.

For examples, see "Estimating and improving interlock for
pipelined architectures" by Callahan, Cocke, and Kennedy.
Proceedings of the 1987 International Conference on Parallel
Processing, August 87.  Or "Compiling C for Vectorization,
Parallelization, and Inline Expansion" by Allen and Johnson,
SIGPLAN 88.  Or (more to the point, but hard to get?) "Why even
scalar machines need vector compilers" by Allen and Lew,
TR, Ardent Computer, January 88.

3) An experiment.  An integer Fortran(!) program.  A non-recursive
version of quicksort.  An experimental, optimizing compiler
for the IBM RT (16 integer registers, 1 is stack pointer).
Compiler does partial redundancy elimination (global common
subexpressions and loop invarients), value numbering over
extended basic blocks, and dead code elimination.
(No strength reduction or global constant propagation).
Uses a Chaitin-style, graph coloring, global register allocator.

With 16 registers, it sorts 200,000 integers in 8.2 seconds.

regs		spills		run-time	obj-size
---------------------------------------------------------
16		3 live ranges	8.2 seconds	360 bytes
14		5		8.3		368
12		8		8.7		400
10		13		10.0		440
8		17		13.2		464

Some caveats, I only tried this one integer program
(as a part of another study), maybe our register allocator
isn't the best for just a few registers, and so forth...

On the other hand, our allocator does a good job and lots of FP
intensive programs suggest that we could often effectively use many
more than 16 integer registers.


Finally, let's talk about optimization.  
Generally, optimization competes with register allocation --
optimization lengthens live ranges and increases register pressure.

I think that optimization by the compiler vs. the programmer is a Good Thing.
It lets the programmer write good code more quickly and compactly.

For example, consider the simple C statements

	for (i=0; i<length; i++)
	    array[i] = brray[i] + crray[i];

Well, we could strength reduce by hand (and save a very important
loop test (sarcasm)), giving

	ap = &array[0];
	bp = &brray[0];
	cp = &crray[0];
	do {
	    *ap = *bp + *cp;
	    ap++;
	    bp++;
	    cp++;
	} while (ap < &array[length]);

But, I'd like to see the optimizer produce

	i = 0;
	if (i < length) {
	    do {
		*(array+i) = *(brray+i) + *(crray+i);
		i++;
	    } while (i < length);
	}

This version saves registers, saves branches, is safe,
and creates opportunites for hoisting loop invarients.

The first example is, I suspect, easier to write and maintain than
the other examples.  Additionally, it's also easier for the
optimizer to understand.  

Finally, the straightforward style
is more portable (if you do all your optimization at the source level,
you must know how many register are available, ...).
For (a final) example, consider DMXPY, from LINPACK.
The basic computation is:
	do 1 j = 1, n2
	    do 1 i = 1, n1
1		y(i) = y(i) + x(j) * m(i, j)

In LINPACK though,
it's been carefully hand coded to produce nice code
on some machine.  Many loops have been unrolled and
the results are probably fabulous on a Cray.
The main loop looks like
	do 1 j = jmin, n2, 16
	    do 1 i = 1, n1
1		y(i) = ((((((((((((((( (y(i))
		     + x(j-15) * m(i, j-15) + x(j-14) * m(i, j-14))
		     + x(j-13) * m(i, j-13) + x(j-12) * m(i, j-12))
		     + x(j-11) * m(i, j-11) + x(j-10) * m(i, j-10))
		     + x(j- 9) * m(i, j- 9) + x(j- 8) * m(i, j- 8))
		     + x(j- 7) * m(i, j- 7) + x(j- 6) * m(i, j- 6))
		     + x(j- 5) * m(i, j- 5) + x(j- 4) * m(i, j- 4))
		     + x(j- 3) * m(i, j- 3) + x(j- 2) * m(i, j- 2))
		     + x(j- 1) * m(i, j- 1) + x(j   ) * m(i, j   ))

A fairly complex expression.  The results weren't
very fabulous on my RT.
I count 16 floating point values
that are loop invarient in the i loop.  Tough to handle
with only 8 FP registers.  It would also take a healthy optimizer
to generate a minimal set of addressing expressions for all the
array references.

On the other hand, the basic code is (by comparison) crystal clear.
A fancy, dependence-based optimizer could rework it to run quickly
on an RT, a MIPs, or a Cray; the optimizations used depending
upon the architecture of the target.

So, ...
lots of inflammatory material I guess.

	Regards,
	Preston Briggs