Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!unix.cis.pitt.edu!dsinc!ub!uhura.cc.rochester.edu!rochester!pt.cs.cmu.edu!sei!firth From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.arch Subject: Re: Optimal Computer Architectures Message-ID: <9507@fy.sei.cmu.edu> Date: 12 Nov 90 20:55:13 GMT References: <212412@<1990Nov8> <3300209@m.cs.uiuc.edu> <334@mentat.COM> <1990Nov11.102907.7706@cs.cmu.edu> Reply-To: firth@sei.cmu.edu (Robert Firth) Organization: Software Engineering Institute, Pittsburgh, PA Lines: 49 In article <1990Nov11.102907.7706@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: >It is my understanding that the SunOS 4.x compiler does >detect and eliminate tail recursion. This is verified by >compiling (-O4 -S) > >ENUM(i) { if (i==0) return 1; else return ENUM(i-1); } > >and seeing > >_ENUM: >L77001: > tst %o0 > bne L77003 > nop > retl > add %g0,1,%o0 >L77003: > b L77001 > dec %o0 > > >It seems odd that it would generate >a branch to another branch. can anyone >who knows comment? As a preliminary note: the program mentioned by the original poster does not lend itself to tail-recursion elimination, so this won't work for him. Another option, since the code is fairly short, would be to hack an assembler version that didn't use the register windows (as was rightly pointed out, they are optional, not an ineradicable part of the call protocol). I actually tried that on Ackermann's Function (what else?) and got a factor of 7 speedup. Turning to the above code - yes there does seem to be a defect in the low-level optimiser here. There are problems in eliding branch chains on machines with delay slots, but that doesn't apply above: the delay slot of the BNE is empty, so one could close the chain and move the decrement of o0 to the label destination. However, one can do better. Since register o0 is dead on the fall-through (it is destroyed by the add), one can float up both the instructions at L77003: bne L77001 dec %o0 thus closing the chain and filling the delay slot. This reduces the five-instruction loop to three instructions, for a 65% speed up.