Path: utzoo!attcan!uunet!cucstud!tfd!uupsi!rice!bbc From: bbc@rice.edu (Benjamin Chase) Newsgroups: comp.benchmarks Subject: Re: Linpack on SPARCstation 2 vs. SPARCstation 1+ vs. Sun 4/490 Message-ID: Date: 5 Dec 90 18:00:28 GMT References: <14274@leadsv.UUCP> Sender: news@rice.edu (News) Reply-To: Benjamin Chase Distribution: na Organization: Center for Research on Parallel Computations Lines: 63 In-Reply-To: tn@leadsv.UUCP's message of 4 Dec 90 20:13:10 GMT >The optimization levels used were >(0) no optimization; (1) level 1 (-O1); (2) level 2 (-O2); and (3) >level 3 (-O3). >Mflops averages >--------------- >optimization SPARCstation 2 SPARCstation 1 Sun4/490 > level single double single double single double > average average average > > 0 1.7 1.3 1.0 0.7 1.4 1.1 > 1.5 .85 1.2 > > 1 1.7 1.3 1.0 0.7 1.4 1.1 > 1.5 .85 1.2 > > 2 3.7 2.3 2.1 1.1 2.9 2.0 > 3.0 1.6 2.4 > > 3 5.1 3.3 2.7 1.6 4.9 3.1 > 4.2 2.1 4.0 > >(All averages are arithmetic means.) What I found interesting here was the small difference between optimization level 0 and level 1. Checking my Sun f77 manual page, it says that the difference between no optimization and -O1 is peephole optimization. What sort of peephole optimization are we doing? Just filling those delay slots? Generating some code generated from a small C program on my SPARCstation 1, I see that no-ops are generated for all the delay slots. On a RISC, there's not much more to do at the peephole level, if your code generator has half a brain. Looking further, it seems that "as -O1" doesn't fill the delay slots of branches either. Very odd. What sort of peephole optimization is this? The "as" manual page says that -O[n] "enables peephole optimization corresponding to optimization level n (1 if n not specified) of the Sun high-level language compilers". There are different levels of peephole optimization? Different sizes of peepholes, perhaps? Perhaps Sun only does full-blown filling of delay slots, through a large-scale (rather than peephole) analysis of the generated code? Admittedly, this elephant gun approach is necessary to fill those hard-to-fill slots (ie. when you're be turning on the "annul" bit of the branch, inhibiting execution of the instruction in the delay slot when the branch is not taken). And if you've got the elephant gun approach working, why let a popgun (ie. peephole optimizer) look for the easy marks? Looks like I need to teach my cute SPARC disassembler to use symbolic labels for branch targets, so I can get a meaningful diff between disassembled versions of each flavor of code, to actually see what the peephole optimizer is or isn't doing. I suspect any followup to this post probably needs to go somewhere other than comp.benchmarks, though I don't know which other group to pick. I seemed to have wandered into the land of instruction scheduling and SPARC assembly language... -- Ben Chase , Rice University, Houston, Texas