Path: utzoo!attcan!uunet!cucstud!tfd!uupsi!rice!bbc
From: bbc@rice.edu (Benjamin Chase)
Newsgroups: comp.benchmarks
Subject: Re: Linpack on SPARCstation 2 vs. SPARCstation 1+ vs. Sun 4/490
Message-ID: <BBC.90Dec5130028@libya.rice.edu>
Date: 5 Dec 90 18:00:28 GMT
References: <14274@leadsv.UUCP>
Sender: news@rice.edu (News)
Reply-To: Benjamin Chase <bbc@rice.edu>
Distribution: na
Organization: Center for Research on Parallel Computations
Lines: 63
In-Reply-To: tn@leadsv.UUCP's message of 4 Dec 90 20:13:10 GMT

>The optimization levels used were
>(0) no optimization; (1) level 1 (-O1); (2) level 2 (-O2); and (3)
>level 3 (-O3).

>Mflops averages
>---------------
>optimization    SPARCstation 2      SPARCstation 1        Sun4/490
>   level        single  double      single  double      single  double
>                   average             average             average
>
>     0          1.7     1.3         1.0     0.7         1.4     1.1
>                    1.5                 .85                 1.2
>
>     1          1.7     1.3         1.0     0.7         1.4     1.1
>                    1.5                 .85                 1.2
>
>     2          3.7     2.3         2.1     1.1         2.9     2.0
>                    3.0                 1.6                 2.4
>
>     3          5.1     3.3         2.7     1.6         4.9     3.1
>                    4.2                 2.1                 4.0
>
>(All averages are arithmetic means.)

What I found interesting here was the small difference between
optimization level 0 and level 1.  Checking my Sun f77 manual page, it
says that the difference between no optimization and -O1 is peephole
optimization.  What sort of peephole optimization are we doing?  Just
filling those delay slots?

Generating some code generated from a small C program on my
SPARCstation 1, I see that no-ops are generated for all the delay
slots.  On a RISC, there's not much more to do at the peephole level,
if your code generator has half a brain.

Looking further, it seems that "as -O1" doesn't fill the delay slots
of branches either.  Very odd.  What sort of peephole optimization is
this?  The "as" manual page says that -O[n] "enables peephole
optimization corresponding to optimization level n (1 if n not
specified) of the Sun high-level language compilers".  There are
different levels of peephole optimization?  Different sizes of
peepholes, perhaps?

Perhaps Sun only does full-blown filling of delay slots, through a
large-scale (rather than peephole) analysis of the generated code?
Admittedly, this elephant gun approach is necessary to fill those
hard-to-fill slots (ie. when you're be turning on the "annul" bit of
the branch, inhibiting execution of the instruction in the delay slot
when the branch is not taken).  And if you've got the elephant gun
approach working, why let a popgun (ie. peephole optimizer) look for
the easy marks?

Looks like I need to teach my cute SPARC disassembler to use symbolic
labels for branch targets, so I can get a meaningful diff between
disassembled versions of each flavor of code, to actually see what the
peephole optimizer is or isn't doing.

I suspect any followup to this post probably needs to go somewhere
other than comp.benchmarks, though I don't know which other group to
pick.  I seemed to have wandered into the land of instruction
scheduling and SPARC assembly language...
--
	Ben Chase <bbc@rice.edu>, Rice University, Houston, Texas