Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.lang.misc
Subject: Re: function calls
Message-ID: <5991@brazos.Rice.edu>
Date: 22 Mar 90 17:39:10 GMT
References: <29551@amdcad.AMD.COM> <14281@lambda.UUCP> <29585@amdcad.AMD.COM>
Sender: root@rice.edu
Organization: Rice University, Houston
Lines: 120

In article <29585@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

>| question didn't allocate registers very well.  Especially in scientific
>| packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
>| execution of even simple routines.  Vectors, arrays, lists, strings, etc,
>| are alle being either produced or consumed.  The fact that none of these

>The compiler that I used to collect the stats will keep all live
>scalars in the register file, but doesn't do inlining of functions,
>and doesn't try to operate on arrays in registers (how do you generate
>code to deal with runtime address calculations???)

Well, you don't keep the whole array in registers, and you don't keep
it there all the time.  There's a paper coming in this summer's
compiler construction conference (Callahan, Carr, and Kennedy) that
shows how to do it.  In the meantime, I can sort of illustrate
the ideas.  We'll use matrix multiply

	DO i = 1, n
	    DO j = 1, n
		A(i, j) = 0
		DO k = 1, n
		    A(i, j) = A(i, j) + B(j, k) * C(k, i)
		END
	    END
	END

Everybody knows the trick of keeping A(i, j) in a scalar (so it ends
up in a register).

	DO i = 1, n
	    DO j = 1, n
		aij = 0
		DO k = 1, n
		    aij = aij + B(j, k) * C(k, i)
		END
		A(i, j) = aij
	    END
	END

So, an easy transformation that keeps a (tiny) part of an array in a register.
We can do lots more.  Let's unroll the j loop (we'll assume n is divisible
by 2.  if it isn't, the transformation is still possible, but extra code
must be inserted).

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		DO k = 1, n
		    aij = aij + B(j, k) * C(k, i)
		END
		A(i, j) = aij
		aij1 = 0
		DO k = 1, n
		    aij1 = aij1 + B(j+1, k) * C(k, i)
		END
		A(i, j+1) = aij1
	    END
	END

Now let's jam the 2 copies of the inner (k) loop together.

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		aij1 = 0
		DO k = 1, n
		    aij  = aij  + B(j,   k) * C(k, i)
		    aij1 = aij1 + B(j+1, k) * C(k, i)
		END
		A(i, j) = aij
		A(i, j+1) = aij1
	    END
	END

and notice the re-use of C(k, i)

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		aij1 = 0
		DO k = 1, n
		    cik = C(i, k)
		    aij  = aij  + B(j,   k) * cik
		    aij1 = aij1 + B(j+1, k) * cik
		END
		A(i, j) = aij
		A(i, j+1) = aij1
	    END
	END

Now we're keeping more pieces of arrays in registers, and our code
is getting faster and faster, especially on machines with fast FP
and (relatively) slow memory.

In the original loop, we had 2n^3 flops, 3n^3 loads, and n^3 stores.
In the last loop, we're down to 2n^3 flops, (3n^3)/2 loads, and n^2 stores.
Notice the number of floating point ops stays the same; we're just reducing
the amount of memory traffic by getting reuse from our registers.

Of course, we could do the same unroll-and-jam trick many more times
on both the i and j loops, and we'll get more and more register reuse
(larger and larger chunks of the arrays kept in registers).
On a machine like the MIPS, we can get a factor of 3 improvement
over your favorite C code.  On the AMD 29000, with its really large
register set, we ought to do even better.  Imagine unrolling-and-jamming
both the i and j loops (say) 6 times each.  We'd need 36 registers
to hold pieces of A, and 6 more to hold a strip of C, and 6 more to hold
a strip of B; but the code would sing/scream.

Of course, you need to do some work in your optimizer.
And you need code with loops and arrays (scientific code).
And it's hard to do for C.

(I still don't like fortran.)

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu