Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!apple!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Integer/Multiply/Divide on Sparc
Message-ID: <34110@mips.mips.COM>
Date: 4 Jan 90 06:00:37 GMT
References: <158@csinc.UUCP> <787@stat.fsu.edu> <42701@lll-winken.LLNL.GOV> <788@stat.fsu.edu> <42737@lll-winken.LLNL.GOV> <KHB.90Jan2121328@chiba.kbierman@sun.com> <5842@ncar.ucar.edu> <34058@mips.mips.COM> <28594@amdcad.AMD.COM>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 97

In article <28594@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <34058@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>| 	Could somebody post the critical parts of this again so we can
>| 	look at it?  Although I have high respect for Plum-Hall in general,
>| 	I'm always nervous about micro-level benchmarks.  Now, I hate to have
>| 	to defend SPARC :-), but I must: realistic integer benchmarks
>| 	that I know [like the SPEC ones] simply don't correlate with
>| 	the results claimed below, at least not very much.
>| 	The RISC machines are noticably faster on actual integer programs....

>The benchmarks over-emphasize integer modulus.  For example, the
>benchmark that reportedly tests register-integer variables looks like:
......
>and spends roughly 75% of its time performing the "%" operation.

Like I said, I'm always nervous about micro-level benchmarks, even when done
by smart people.  Here is the summary, followed by details:

SUMMARY OF MY ADVICE:
	1) do NOT EVER use this benchmark to believe it means anything;
	if you have a copy, throw it away.
	2) FORGET any conclusions that anyone has posted about relative
	performance of machines, based on this benchmark, other than
	the possible thought that multiply/divide integer don't happen
	to be done in hardware on SPARC and HP PA.
	3) If you've ever told anyone this means much, please tell them
	you're sorry.
DETAILS:
Bo Thide kindly sent me a copy, and I took a quick look, finding similar
results to Tim's:
	optimized R3000 code spent 60% of the time doing % (and remember, we
		have one of those in hardware...)

Tables were given that looked like this:
                     register  auto      auto       int    function    auto
                       int     short     long    multiply  call+ret    double
-------------------------------------------------------------------------------
cc:
Sun-4     (-O4)      0.41      0.44      0.40      4.45      0.67      1.00    
......

WHAT'S RIGHT:
1) The code is very carefully done to eliminate surprise optimization.

WHAT'S WRONG:
1,2,3) Columns 1, 2, and 3, which purport to measure the performance of various
integer code, are completely dominated by the modulus operation, which is
simply contrary to the statistics of the overpowering bulk of code out there.
It would be plausible to generate something that had a mix of +, -, *, /, %
and logic ops, using carefully chosen frequencies from a number of real
programs (and even there, there are pitfalls), but something that does no *
or /, and % way out of proportion, is guaranteed to blast a SPARC about
as badly as it can be, relative to almost anything else.  It won't help
HP PAs much either....
Also, for column 1, optimized, I got 0% loads, and 5% stores, rather than
the more typical 20% and 10%.

4) This column indeed measures the speed of integer multiply, in such a way
that no compiler can do anything but do real multiplies with it.

5) This column measures the speed of function call/return with zero arguments.
Unfortunately, different programs have different distributions of numbers and
types of arguments, and many functions have arguments.  Different machines
differ greatly in the cost of passing arguments, and i nteh costs of passing
different numbers of arguments....

6) I haven't looked at the statistics of this much, except to notice there are
equal numbers of FP * and /, which is also atypical.
----
7) In general, although it's been said before in this newsgroup:
	a) People design computers using the statistics of real programs.
	b) The statistics of real programs differ, hence the tradeoffs you
	make depend on the benchmarks chosen.
	c) There are certain classes of codes for which at least one of
	integer *, /, or % are important enough, and cannot be gotten rid
	of even by a perfect compiler, where having these in hardware
	will help a lot.Over many realistic programs, hardware helps
	about enough that some people chose to include it, and some didn't.
	There's no way in the world that having it makes a 2-3X performance
	difference, overall, although you can find some real programs where it
	does.
	d) Like many synthetic benchmarks, it simply doesn't have a mixture
	of expressions that relates well to real compilers do, i.e., there is
	little that an optimizing compiler can do with this code, and a small
	number of registers are completely adequate.  Neither of these two
	is generically true for real code.

Anyway, these benchmarks mostly measure integer multiply and divide;
these operations are where most RISCs have the least advantage over
most CISCs; these operations are definitely what anybody would use to
show that some CISC is faster than SPARC or PA; but it just doesn't
correlate very well with the speeds on real programs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086