Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!apple!vsi1!wyse!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: SPARC vs MC68040
Message-ID: <35846@mips.mips.COM>
Date: 12 Feb 90 18:43:05 GMT
References: <8859@portia.Stanford.EDU> <5190@convex.convex.com> <1850@cbnewsi.ATT.COM> <2938@oakhill.UUCP> <3085@rtmvax.UUCP> <35825@mips.mips.COM> <2943@oakhill.UUCP>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 135

In article <2943@oakhill.UUCP> davet@oakhill.UUCP (David Trissel) writes:
>In article <35825@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>
>>>> is 23,148 KDhrys. The MC68040 runs that benchmark at twice the speed.
>
>>A bunch of people at various companies are busily stuffing SPEC numbers into
>>spreadsheets, plus published mips-ratings, and analyzing.  I'm also trying
>>to calibrate i486 and 68040 numbers into this scheme.
>
>What does this have to do with Dhrystone?
Sorry, among other things, when you start looking at such data, you see that:
	a) Dhrystone correlates with integer performance on real benchmarks
	within machine lines, with same compilers, at least somewhat.
	b) It has some correlation among machines lines.
	c) If one machine uses the inline, and one doesn't, the difference
	in performance badly mispredicts the performance on realistic
	programs.
>The Dhrystone benchmarks have known weaknesses. The SPEC benchmarks have their 
>own. Many people are interested in Dhrystone so it gets talked about. If you
>don't care for discussions on Dhrystone then simply ignore them.
Impossible: it casues too much confusion, and I have to keep explaining to
financial analysts, and I'm tired of that.  The SPEC benchmarks have their
own weakenesses of course, but they're hardly in Dhrystone's class.

>>"In any case, for serious performance evaluation, users are  advised  to
>>ask  for  code listings and to check them carefully."

>This true for ALL benchmarks. Do you think it only applies to Dhrystone?
>Do you think it does not apply to the SPEC benchmarks? I know I wouldn't
>be choosing a computer architecture without looking at code the compiler
>produces.
Of course, but in Dhrystone's case, if all you ahve is the nubmers for two
machines, you know very little about their relative performance, without
looking at the code; it is especially irksome that it contains an optimization
that improves it's performance greatly, that simply does not improve
realistic programs significantly. (That doesn't mean that selective inlining
of strings is bad; in fact, if Dhrystone contined a REPRESENTATIVE set of
string operations, I wouldn't object so much, but it doesn't.)

>
>>	EVERYBODY knows that inlining strcpy&strcmp can boost the number
>>	strongly without giving anything like that boost on real programs.
>>	SO POST THE CODE WHERE THE CRUCIAL STRCPY/STRCMP calls are made;
>>	otherwise, the number is simply meaningless, because anybody can
>>	boost the performance substantially on Dhrystone by an optimization
>>	that has relatively little effect on real programs.
>
>I fail to understand your tone here. By your own admission in a posting
>you did to this newsgroup on March 15, 1989:
>
> "Now, according to the letter or the law of Herr Doktor Weicker's 
>  Dhrystone 2.1 writeup, it's OK to in-line strcpy and strcmp.

Yes, but subject to the comment above,which most people will not do,
i.e., hardly anyone shows the code for this.  The SPEC benchmarks were chosen
to allow any optimization you like, but have the effect that there are very
few optimizations you can do that won't help lots of real programs.
>
>and this is what the MC68040 compiler does. So just what is the problem? 
>Here is one of the string copies (they all look similar) directly from the 
>benchmark's .s file:
>
>    lea.l   (12,%sp),%a5
>    mov.l   %a5,%a1
>    mov.l   &L%93,%a0
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.l   (%a0)+,(%a1)+
>    mov.w   (%a0)+,(%a1)+
>    mov.b   (%a0)+,(%a1)+
	Good! you at least did it correctly in general case, unlike the i860
	that pads this to 32-bytes so it can do 2 quad-word loads & stores...
>
>Now let's see you post the code that your MIPS compiler produces. Then tell
>us what you find to be relevant about the two postings.
Dhrystone usually overpredicts VAX-relative performance; on most machines,
if I know that this inlining is being done, I can estimate that it overpredicts
it another 20-30%.  That's what's relevant.

The numbers we use all come from:
	jal	strcpy
and I've seen the SPARC code as the equivalent; meaning, I think such things
don't overpredict as much (they still overpredict, and this has been
well-documented for years in published materials.)

And the reason (we don't inline str*) is:
	a) When you inline code it gets bigger.
	b) You might want to inline it only in those places it's called a lot.
	c) But there's acomplicated set of rules for when it's really a good
	idea in general.  Among other things, MOST strcpy's aren't of constants,
	they're of pointers to things whose alignment can't be predicted,
	or at least the target is some arbitrary pointer, and then this
	optimization doesn't work very well.  The only one I've seen that looked
	like it would really pay off is inlining strcpy's of small constants
	(1-2 bytes), or ones where you happen to know the alignment, and then
	up to a few words.
	d) Remember, we actually do full-bore inlining in the general case....
	but are forbidden by the rules from using it.... and we don't.

Here's the bottom line: either Dhrystone is a good predictor of
integer performance on real programs, or it isn't.  If it is (and it once
almost used to be), then it's a Good Thing, because it's simple and easy
to use.  If it doesn't correlate well with performance on real programs,
then it's become obsolete.

Rather than replowing ground that has been plowed for years,
let's try something else, as a bottom line, and get something concrete:
QUIZ:

It is claimed that a 25MHz 68040 is 2X faster than a 25MHz SPARC on Dhrystone;
for concreteness, consider a 68040 with at least 64K external cache,

a) Will it be 2X faster on the Geometric Mean of the 4 SPEC C benchmarks?
(Using same compiler as Dhrystone.)
b) Will it be more than 2X?
c) Will it be less than 2X?
d) Will it be a lot less than 2X, in fact, maybe closer to 1X?

I'd encourage anyone who posts to post an Analysis to back up their opinion,
with some data; I'm working on a Guesstimate for about a week from now.

If someone prefers other realistic benchmarks, that would be a good exercise
as well.

In any case, thanx to Mr. Trissell for properly qualifying the Dhrystone
number; this actually helps a lot.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086