Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!ucsd!sdd.hp.com!zaphod.mps.ohio-state.edu!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.benchmarks
Subject: Re: SPEC vs. Dhrystone
Message-ID: <44465@mips.mips.COM>
Date: 3 Jan 91 06:18:59 GMT
References: <44342@mips.mips.COM> <15379@ogicse.ogi.edu> <44353@mips.mips.COM> <1685@marlin.NOSC.MIL> <15546@ogicse.ogi.edu>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Distribution: comp.benchmarks
Organization: MIPS Computer Systems, Inc.
Lines: 140

In article <15546@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
>In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes:
>>In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>>(Note, for example, that published Dhrystone results easily mis-predict
>>>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
>>>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
>>>realistic integer benchmarks.)
>>This is an interesting observation (result).
>>Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer
>>programs. That is, hundreds (I believe) of programs were 
>>analyzed to come up with the (ahem) 'typical' high level
>>language instructions and their frequency of usage. In view of this 
>>I would, at first sight, suspect the Dhrystone to be more accurate
>>than SPEC as SPEC is based upon only a few integer programs. 
>I suspect there are two factors at work here.  First, Dhrystone is a
>fairly small benchmark, and would not exercise the memory hierarchy 
>as hard as the real programs in SPEC.  The second factor is that it is
>easier to tune a compiler to a small benchmark like Dhrystone (or for
>that matter Whetstone and the Livermore Loops) than it is to tune for
>a variety of different real programs.  By the way, I believe Dhrystone
>was originally written in ADA and was translated to "C", the form in
>which it is usually run.
>>Real programs also show a great variation in performance.  I noticed

Well, there are multiple issues, of which several have been mentioned
already here.  Let us review the history of Dhrystone, which was originally,
as stated, a reasonable attempt to model ADA usage, and then got
converted into C.

ISSUE 1: building small synthetic benchmark that is TYPICAL of intended usage is
	1a: extremely difficult to do, even for a current set of
		hardware and software
	1b: REALLY hard to do and expct to remain valid over time<
		even in the absence of 1c
	1c: really hard to do, if small enough to be subject to compiler
		gimmickkry

Let us take each of these in turn:
	1a: is hard, because the usual (and reasonable) methodology is:
		1: select attributes that should be measured
		2: gather statistics from a set of programs
		3: build the benchmark to model those attributes
The problem is:
	1: you may not choose the "right" attributes, and in fact,
	there is no small set of right attributes.  in fact,
	there are only better an better approximations, even to
	modeling a single user program (not even a large mix).
	For example, suppose your first approximation is:
		count the number of +, -, *, and /. executed
	Not a very good approximation
		so add, number of if statements
	Still not good
		so add, distribution of sizes of expressions
	Still not good
		so add, number of function calls
	Still not good,
		so add, distribution of number of arguments of function
		calls (makes bigger difference amongst machines that pass
			some arguments in registers)
	Still not good, some architectures (like SPARC) can be sensitive to
		the depth of function calls, so do somethign about that
	Still not good, haven't done anything about array indexing,
		and different architectures react differently
	Still not good, haven't done anything about pointers, so add
		some pointer references
	Still not good, you haven't measured the frequency of different-
		sized offsets from pointers (and surprise! in some architectures
		there is no different between a zero=offset, and a non-zero
		offset; in others (such as AMD29K), a zero offset is cheaper
		than non-zero; in others, presence of particular addressing
		modes helps some combinations much more than others.
	Still not good: how often is the same pointer->object referenced
		close enough in the code, and under such conditions that
		the compiler can jjust leave it in a register?
	Still not good: is the distribution of variable referencees such
		that the benchmartk will model the effects of a good
		register allocator, or not.
	..... and on, and on....
I.e., it is VERY easy to do a competent job of feature extraction and
modeling, and still get surprised, where "surprise" = the synthetic
benchmark doesn't correlate well with realistic code of the class that it
was supposed to model.  (I've looked at many synthetic benchmarks
with our tools; the numbers quite often don't look anything like what you
see wheen you analyze real programs.)

1b: hard to do over time:
	If asked to compare machines that basically differ only by the
	clock rate (same CPU, same compilers), a small benchmark is
	adequate.  However, hardware tends to get more complex over time,
	in particular, faster machines use caches, caches get bigger,
	multi-level caches appear, etc..  Programs expand to use these;
	if a benchmark doesn't also expand appropriately, it starts to
	measure only the smallest part of the memory hierarchy.

	In addition, optimizing compilers get better, and they optimize
	away pieces of the code, especially in a small synthetic benchmark.

1C: Compiler gimmickry
	For any importnat benchmark that is small, compilers will get tuned
	in ways that are absolutely useless in real life.  This has happened
	at least with Whetstone, Dhrystone, and LINPACK.

ISSUE 2: Dhrystone in particular
	The MIPS Performance Brief, Issue 3.9 (and earlier) has had analyses
	of Dhrystone issues, for years.
	here is a brief summary:
	1. Small, will fit in tiny instruction and data caches
	2. References and re-
	  models effects of write-back & write-thru caches poorly
	3. Subroutine calls sahllow depth, hence never underflows/overflows
	  on aa register window/stack cache machine.
	4. Makes function calls more frequently than any real program
	  I've ever seen, i.e., on a MIPS, uses <40 cycles per call,
	  whereas 60-100 is much more typical of C programs
	5. Can easily spend 30% of it's time in strcpy, unlike any real
	  program I've ever analyzed.  Due to the particular use
	  (copy a 30-byte constant, over and over again), it is especially
	  amenable to gimmickry, such as compiler options which generate
	  incorrect code for real use, but happen to work for Dhrystone.
	  (Note that some of this is an artifact of translation from
	  ADA/Pascal (fixed-length strings) to C.)
	  Most amusing code: i860, where the 30-byte constant is expanded
	  to 32, and is then copied with 2 16-byte loads, followed by
	  2 16-byte stores, not very typical of real C-language string
	  processing, where most pointers are to variables whose sizes
	  are unknown at compile time....
	6. There is an unusually high frequency of zero-offset pointers.
	7. In the earlier versions, there was obvious dead code, which started
	  to disappear under the pressure of better optimizers.
	  (not gimmickry, just better compilers).
	8.  Also, the earlier version never worried about compilers
	  that can merge the whole program together and inline EVERYTHING...

So, Reinhold W. started with somethign that was actually a reasonable
attempt, but it is HARD TO DO, and even HARDER to keep sensible...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086