Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!mintaka!spdcc!iecc!compilers-sender
From: torek@elf.ee.lbl.gov (Chris Torek)
Newsgroups: comp.compilers
Subject: Re: SPARC code generation references
Keywords: SPARC, optimize
Message-ID: <91-05-101@comp.compilers>
Date: 26 May 91 21:45:29 GMT
References: <91-05-100@comp.compilers>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: torek@elf.ee.lbl.gov (Chris Torek)
Organization: Lawrence Berkeley Laboratory, Berkeley
Lines: 236
Approved: compilers@iecc.cambridge.ma.us

[Apologies for all the quoting, but I found it difficult to edit this
down any further.]

In article <91-05-100@comp.compilers> ressler@cs.cornell.edu (Gene Ressler)
quotes pardo@cs.washington.edu (David Keppel):
>Non-leaf procedures can be optimized if non-leaf behavior is uncommon.  That
>is,
>
>    int foo(int *a, int *n) {
>      if (*n == 0) reload(a, BUFSIZE);
>      --*n;
>      return (a[*n]);
>    }
>
>is compiled by all compilers I know of as:
>
>    save
>    test
>    branch
>    call
>  norefill:
>    sub
>    index
>    ret
>    restore
>
>or some such.  Faster on average is:
>
>    test
>    branch
>    save	// moved
>    call
>    restore	// moved
>  norefill:
>    sub
>    ret
>    index
>
>but harder to determine when it is profitable.

Oddly enough, a day or two before this appeared I was talking to John Gilmore
and Sean Fagan about this very issue.  The answer is that this is profitable
more often than might appear at first, and the technique can be (and I claim
should be) applied to every function.  The general idea is this:

	Move save and restore instructions `inward', renumbering
	registers, until they `bump into' call instructions or until
	the registers do not fit in the free %o set.  (Then move the
	restore outward to fit in a delay slot, if necessary.)  If
	there are no subcalls, or a variable is not live across a
	subcall, %g temporaries may be used as well to reduce the
	register pressure.

This technique might be called `leaf crushing' (I wanted to call it `leaf
burning' but that was a bit too much of a stretch :-) ).

The trick is that the input parameters to each function show up in `%o0'
through `%o5', which `change over' to `%i0' through `%i5' when a SAVE
instruction is done.  The change is done inside the CPU by renumbering the
registers (changing the register window pointer) and can be done the same way
in the compiler.

You might think that you would often run into register allocation problems,
but the `not live across subcall' condition helps out quite a bit.

Let us use the original function as an example.  The obvious naive expansion
of

	int foo(int *a, int *n) {
		if (*n == 0)
			reload(a, BUFSIZE);
		--*n;
		return (a[*n]);
	}

is

	_foo:
		save			! get a new window
		ld	[%i1], %o0	! *n
		tst	%o0		! 0?
		bnz	L1		! no, skip call
		 nop			! nothing to do in delay slot
		mov	%i0, %o0	! first parameter for reload()
		call	_reload		! reload(a, BUFSIZ)
		 mov	1024, %o1	! second parameter for reload()
	L1:
		ld	[%i1], %o0	! *n
		dec	%o0		! - 1
		st	%o0, [%i1]	! fix *n
	!	ld	[%i1], %o0	! (only if terribly stupid)
		sll	%o0, 2, %o0	! * 4
		add	%i0, %o0, %o0	! a + *n
		ld	[%o0], %o0	! return value
		ret			! jump back to caller
		 restore		! but first go back to previous window

Some fairly simple alias analysis should reduce this to:

	_foo:
		save
		ld	[%i1], %o0	! *n
		tst	%o0		! 0?
		bnz	L1		! no, skip call
		 nop
		mov	%i0, %o0	! as before
		call	_reload
		 mov	1024, %o1
		ld	[%i1], %o0	! *n may have changed
	L1:
		dec	%o0		! compute *n - 1
		st	%o0, [%i1]	! update *n
		sll	%o0, 2, %o0	! as before
		add	%i0, %o0, %o0
		ld	[%o0],
		ret
		 restore

but this is not all that much better.  (The obvious delay-slot fill, copying
the `dec %o0' up and using an annulling branch, and moving L1 down one line,
can be done later.)

Now let us try `leaf-crushing':

	int foo(int *a, int *n) {
		if (*n == 0)
	---------------------------------------
			reload(a, BUFSIZE);
			[[note: *n is unknown]]
	---------------------------------------
		--*n;
		return (a[*n]);
	}

Here I have marked the last `no subroutine calls' line and the first `end of
all subroutine calls' lines.  These mark the barriers for `easy' moving of
save/restore.  I have also noted where the value of *n becomes unknown and
must be reloaded (typically, in C at least, compilers assume that anything
pointed-to is unknown after any subroutine call, so this is not an unusual
case).  If we can estimate register needs, we can see that, before and after
the call, all we need are the two parameters `a' and `n', and a temporary to
hold *n; this easily fits in the leaf set.  So now we generate this:

	_foo:
		ld	[%o1], %o2	! *n
		tst	%o2		! 0?
		bnz	L1		! no, skip call
		 nop
		save			! not a leaf
--->		mov	%i0, %o0	! first parameter for reload()
		call	_reload		! etc.
		 mov	1024, %o1
--->		ld	[%i1], %i2	! *n may have changed
		restore
	L1:
		dec	%o2		! compute *n - 1
		st	%o2, [%o1]	! update *n
		sll	%o2, 2, %o2	! as before
		add	%o0, %o2, %o2
		ld	[%o2], %o0
		retl			! leaf return
		 nop

The lines marked with arrows ---> show the register renumbering effect.
Within the `non-leaf' part of the function, everything that was called `%o'
is called `%i' instead.  Other temporaries that do not die across the `save'
or `restore' boundaries may be kept in %g registers as well; these are not
renumbered, but do die across the calls themselves (unlike the %o/%i
temporaries---this makes %o/%i temporaries `more valuable').

If we do the usual delay-slot filling on the last example, we get just what
is desired:

	_foo:
		ld	[%o1], %o2
		tst	%o2		! if *n != 0 ...
		bnz,a	L1		! compute *n - 1 and skip call
		 dec	%o2
		save
		mov	%i0, %o0
		call	_reload
		 mov	1024, %o1
		ld	[%i1], %i2	! update register holding *n
		restore
		dec	%o2		! compute *n - 1
	L1:
		st	%o2, [%o1]	! update *n
		sll	%o2, 2, %o2
		add	%o0, %o2, %o2	! compute &a[n]
		retl			! leaf return
		 ld	[%o2], %o0	! value = a[n]

As long as (6 - n_parameters [%o registers] + 7 [%g registers]) is enough
registers, `save' and `restore' instructions can be moved inward at no cost
whatsoever---it may not help, but it will not hurt.  David Keppel is,
however, correct in pointing out that once these registers run out, or if the
`save' and `restore' instructions are to be duplicated (see below), the
profitability of moving the save and restore inward is murkier.

The typical case that does not benefit here, but might from a more
sophisticated technique, is code of the form:

	hard(parms) {
		if (something)
			a();
		compute;
		if (somethingelse)
			b();
	}

where the `something' and `somethingelse' are rare.  It would be possible for
a programmer with a leaf-crushing compiler to rewrite this as:

	uglier(parms) {
		if (!something && !somethingelse) {
			compute;
			return;
		}
		if (something)
			a();
		compute;
		if (somethingelse)
			b();
	}

and in fact this sort of duplication is useful if the `something's do
not change across an inner loop.  Common-code-path-analyzing compilers
(such as MIPS', using pixie feedback) might do this automatically
anyway; if this is done before leaf-crushing, functions like `hard'
above will be properly optimized.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.