Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!unix.cis.pitt.edu!dsinc!ub!uhura.cc.rochester.edu!rochester!pt.cs.cmu.edu!sei!firth
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.arch
Subject: Re: Optimal Computer Architectures
Message-ID: <9507@fy.sei.cmu.edu>
Date: 12 Nov 90 20:55:13 GMT
References: <212412@<1990Nov8> <3300209@m.cs.uiuc.edu> <334@mentat.COM> <1990Nov11.102907.7706@cs.cmu.edu>
Reply-To: firth@sei.cmu.edu (Robert Firth)
Organization: Software Engineering Institute, Pittsburgh, PA
Lines: 49

In article <1990Nov11.102907.7706@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>It is my understanding that the SunOS 4.x compiler does
>detect and eliminate tail recursion.  This is verified by
>compiling (-O4 -S)
>
>ENUM(i) { if (i==0) return 1; else return ENUM(i-1); }
>
>and seeing
>
>_ENUM:
>L77001:
>        tst     %o0
>        bne     L77003
>        nop
>        retl
>        add     %g0,1,%o0
>L77003:
>        b       L77001
>        dec     %o0
>
>
>It seems odd that it would generate
>a branch to another branch.  can anyone
>who knows comment?

As a preliminary note: the program mentioned by the original poster
does not lend itself to tail-recursion elimination, so this won't
work for him.  Another option, since the code is fairly short, would
be to hack an assembler version that didn't use the register windows
(as was rightly pointed out, they are optional, not an ineradicable
part of the call protocol).  I actually tried that on Ackermann's
Function (what else?) and got a factor of 7 speedup.

Turning to the above code - yes there does seem to be a defect in
the low-level optimiser here.  There are problems in eliding branch
chains on machines with delay slots, but that doesn't apply above:
the delay slot of the BNE is empty, so one could close the chain and
move the decrement of o0 to the label destination.

However, one can do better.  Since register o0 is dead on the
fall-through (it is destroyed by the add), one can float up both
the instructions at L77003:

	bne L77001
	dec %o0

thus closing the chain and filling the delay slot.  This reduces
the five-instruction loop to three instructions, for a 65% speed up.