Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!mintaka!spdcc!iecc!compilers-sender From: torek@elf.ee.lbl.gov (Chris Torek) Newsgroups: comp.compilers Subject: Re: SPARC code generation references Keywords: SPARC, optimize Message-ID: <91-05-101@comp.compilers> Date: 26 May 91 21:45:29 GMT References: <91-05-100@comp.compilers> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: torek@elf.ee.lbl.gov (Chris Torek) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 236 Approved: compilers@iecc.cambridge.ma.us [Apologies for all the quoting, but I found it difficult to edit this down any further.] In article <91-05-100@comp.compilers> ressler@cs.cornell.edu (Gene Ressler) quotes pardo@cs.washington.edu (David Keppel): >Non-leaf procedures can be optimized if non-leaf behavior is uncommon. That >is, > > int foo(int *a, int *n) { > if (*n == 0) reload(a, BUFSIZE); > --*n; > return (a[*n]); > } > >is compiled by all compilers I know of as: > > save > test > branch > call > norefill: > sub > index > ret > restore > >or some such. Faster on average is: > > test > branch > save // moved > call > restore // moved > norefill: > sub > ret > index > >but harder to determine when it is profitable. Oddly enough, a day or two before this appeared I was talking to John Gilmore and Sean Fagan about this very issue. The answer is that this is profitable more often than might appear at first, and the technique can be (and I claim should be) applied to every function. The general idea is this: Move save and restore instructions `inward', renumbering registers, until they `bump into' call instructions or until the registers do not fit in the free %o set. (Then move the restore outward to fit in a delay slot, if necessary.) If there are no subcalls, or a variable is not live across a subcall, %g temporaries may be used as well to reduce the register pressure. This technique might be called `leaf crushing' (I wanted to call it `leaf burning' but that was a bit too much of a stretch :-) ). The trick is that the input parameters to each function show up in `%o0' through `%o5', which `change over' to `%i0' through `%i5' when a SAVE instruction is done. The change is done inside the CPU by renumbering the registers (changing the register window pointer) and can be done the same way in the compiler. You might think that you would often run into register allocation problems, but the `not live across subcall' condition helps out quite a bit. Let us use the original function as an example. The obvious naive expansion of int foo(int *a, int *n) { if (*n == 0) reload(a, BUFSIZE); --*n; return (a[*n]); } is _foo: save ! get a new window ld [%i1], %o0 ! *n tst %o0 ! 0? bnz L1 ! no, skip call nop ! nothing to do in delay slot mov %i0, %o0 ! first parameter for reload() call _reload ! reload(a, BUFSIZ) mov 1024, %o1 ! second parameter for reload() L1: ld [%i1], %o0 ! *n dec %o0 ! - 1 st %o0, [%i1] ! fix *n ! ld [%i1], %o0 ! (only if terribly stupid) sll %o0, 2, %o0 ! * 4 add %i0, %o0, %o0 ! a + *n ld [%o0], %o0 ! return value ret ! jump back to caller restore ! but first go back to previous window Some fairly simple alias analysis should reduce this to: _foo: save ld [%i1], %o0 ! *n tst %o0 ! 0? bnz L1 ! no, skip call nop mov %i0, %o0 ! as before call _reload mov 1024, %o1 ld [%i1], %o0 ! *n may have changed L1: dec %o0 ! compute *n - 1 st %o0, [%i1] ! update *n sll %o0, 2, %o0 ! as before add %i0, %o0, %o0 ld [%o0], ret restore but this is not all that much better. (The obvious delay-slot fill, copying the `dec %o0' up and using an annulling branch, and moving L1 down one line, can be done later.) Now let us try `leaf-crushing': int foo(int *a, int *n) { if (*n == 0) --------------------------------------- reload(a, BUFSIZE); [[note: *n is unknown]] --------------------------------------- --*n; return (a[*n]); } Here I have marked the last `no subroutine calls' line and the first `end of all subroutine calls' lines. These mark the barriers for `easy' moving of save/restore. I have also noted where the value of *n becomes unknown and must be reloaded (typically, in C at least, compilers assume that anything pointed-to is unknown after any subroutine call, so this is not an unusual case). If we can estimate register needs, we can see that, before and after the call, all we need are the two parameters `a' and `n', and a temporary to hold *n; this easily fits in the leaf set. So now we generate this: _foo: ld [%o1], %o2 ! *n tst %o2 ! 0? bnz L1 ! no, skip call nop save ! not a leaf ---> mov %i0, %o0 ! first parameter for reload() call _reload ! etc. mov 1024, %o1 ---> ld [%i1], %i2 ! *n may have changed restore L1: dec %o2 ! compute *n - 1 st %o2, [%o1] ! update *n sll %o2, 2, %o2 ! as before add %o0, %o2, %o2 ld [%o2], %o0 retl ! leaf return nop The lines marked with arrows ---> show the register renumbering effect. Within the `non-leaf' part of the function, everything that was called `%o' is called `%i' instead. Other temporaries that do not die across the `save' or `restore' boundaries may be kept in %g registers as well; these are not renumbered, but do die across the calls themselves (unlike the %o/%i temporaries---this makes %o/%i temporaries `more valuable'). If we do the usual delay-slot filling on the last example, we get just what is desired: _foo: ld [%o1], %o2 tst %o2 ! if *n != 0 ... bnz,a L1 ! compute *n - 1 and skip call dec %o2 save mov %i0, %o0 call _reload mov 1024, %o1 ld [%i1], %i2 ! update register holding *n restore dec %o2 ! compute *n - 1 L1: st %o2, [%o1] ! update *n sll %o2, 2, %o2 add %o0, %o2, %o2 ! compute &a[n] retl ! leaf return ld [%o2], %o0 ! value = a[n] As long as (6 - n_parameters [%o registers] + 7 [%g registers]) is enough registers, `save' and `restore' instructions can be moved inward at no cost whatsoever---it may not help, but it will not hurt. David Keppel is, however, correct in pointing out that once these registers run out, or if the `save' and `restore' instructions are to be duplicated (see below), the profitability of moving the save and restore inward is murkier. The typical case that does not benefit here, but might from a more sophisticated technique, is code of the form: hard(parms) { if (something) a(); compute; if (somethingelse) b(); } where the `something' and `somethingelse' are rare. It would be possible for a programmer with a leaf-crushing compiler to rewrite this as: uglier(parms) { if (!something && !somethingelse) { compute; return; } if (something) a(); compute; if (somethingelse) b(); } and in fact this sort of duplication is useful if the `something's do not change across an inner loop. Common-code-path-analyzing compilers (such as MIPS', using pixie feedback) might do this automatically anyway; if this is done before leaf-crushing, functions like `hard' above will be properly optimized. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.