Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan!preston From: preston@titan.rice.edu (Preston Briggs) Newsgroups: comp.lang.misc Subject: Re: function calls Message-ID: <5991@brazos.Rice.edu> Date: 22 Mar 90 17:39:10 GMT References: <29551@amdcad.AMD.COM> <14281@lambda.UUCP> <29585@amdcad.AMD.COM> Sender: root@rice.edu Organization: Rice University, Houston Lines: 120 In article <29585@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: >| question didn't allocate registers very well. Especially in scientific >| packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during >| execution of even simple routines. Vectors, arrays, lists, strings, etc, >| are alle being either produced or consumed. The fact that none of these >The compiler that I used to collect the stats will keep all live >scalars in the register file, but doesn't do inlining of functions, >and doesn't try to operate on arrays in registers (how do you generate >code to deal with runtime address calculations???) Well, you don't keep the whole array in registers, and you don't keep it there all the time. There's a paper coming in this summer's compiler construction conference (Callahan, Carr, and Kennedy) that shows how to do it. In the meantime, I can sort of illustrate the ideas. We'll use matrix multiply DO i = 1, n DO j = 1, n A(i, j) = 0 DO k = 1, n A(i, j) = A(i, j) + B(j, k) * C(k, i) END END END Everybody knows the trick of keeping A(i, j) in a scalar (so it ends up in a register). DO i = 1, n DO j = 1, n aij = 0 DO k = 1, n aij = aij + B(j, k) * C(k, i) END A(i, j) = aij END END So, an easy transformation that keeps a (tiny) part of an array in a register. We can do lots more. Let's unroll the j loop (we'll assume n is divisible by 2. if it isn't, the transformation is still possible, but extra code must be inserted). DO i = 1, n DO j = 1, n, 2 aij = 0 DO k = 1, n aij = aij + B(j, k) * C(k, i) END A(i, j) = aij aij1 = 0 DO k = 1, n aij1 = aij1 + B(j+1, k) * C(k, i) END A(i, j+1) = aij1 END END Now let's jam the 2 copies of the inner (k) loop together. DO i = 1, n DO j = 1, n, 2 aij = 0 aij1 = 0 DO k = 1, n aij = aij + B(j, k) * C(k, i) aij1 = aij1 + B(j+1, k) * C(k, i) END A(i, j) = aij A(i, j+1) = aij1 END END and notice the re-use of C(k, i) DO i = 1, n DO j = 1, n, 2 aij = 0 aij1 = 0 DO k = 1, n cik = C(i, k) aij = aij + B(j, k) * cik aij1 = aij1 + B(j+1, k) * cik END A(i, j) = aij A(i, j+1) = aij1 END END Now we're keeping more pieces of arrays in registers, and our code is getting faster and faster, especially on machines with fast FP and (relatively) slow memory. In the original loop, we had 2n^3 flops, 3n^3 loads, and n^3 stores. In the last loop, we're down to 2n^3 flops, (3n^3)/2 loads, and n^2 stores. Notice the number of floating point ops stays the same; we're just reducing the amount of memory traffic by getting reuse from our registers. Of course, we could do the same unroll-and-jam trick many more times on both the i and j loops, and we'll get more and more register reuse (larger and larger chunks of the arrays kept in registers). On a machine like the MIPS, we can get a factor of 3 improvement over your favorite C code. On the AMD 29000, with its really large register set, we ought to do even better. Imagine unrolling-and-jamming both the i and j loops (say) 6 times each. We'd need 36 registers to hold pieces of A, and 6 more to hold a strip of C, and 6 more to hold a strip of B; but the code would sing/scream. Of course, you need to do some work in your optimizer. And you need code with loops and arrays (scientific code). And it's hard to do for C. (I still don't like fortran.) -- Preston Briggs looking for the great leap forward preston@titan.rice.edu