Path: utzoo!yunexus!geac!syntron!jtsv16!uunet!husc6!bloom-beacon!think!ames!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Benchmarking
Message-ID: <5356@winchester.mips.COM>
Date: 11 Oct 88 17:15:27 GMT
Article-I.D.: winchest.5356
References: <2220003@hpausla.HP.COM> <46500026@uxe.cso.uiuc.edu> <1988Oct9.011633.13259@utzoo.uucp> <6001@june.cs.washington.edu>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 71

In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>Many postings in this stream seem to assume that "large, real" programs are
>somehow the most fair to use for benchmarking.  That's not necessarily true.
As we've said numerous times, the best benchmark for anybody is for them
to run their own real applications, because such applications obviously
have the highest correlation with what they'll see in real use.
When I keep saying "use large, real programs", it's because I usually
have in front of me numerous statistics about the behavior of programs
that show that most of the toy benchmarks aren't very good predictors of
the real applications, especially when applied to the higher-performance
designs.  Why is this?
	a) Toys don't stress cache designs, so that small caches and large
	ones act about the same, which is simply untrue for many real programs.
	(in this case, "cache" includes any place in the memory herarchy,
	including registers, stack caches, register windows, 1-to-n-level
	of memory caches, disk caches in main memory, etc.
	b) Toys don't stress limits.  For example, consider the performance
	differences attributable to the different X86 memory models.
	c) Toys don't stress software.  Anybody can compile Dhrystone or
	Whetstone, and many can optimize them.  Compiling/optimizing Spice
	tells you a lot more.

>Any program that has had all or most of its development on a single system
>has undoubtedly been tuned for best performance ON THAT SYSTEM.  Look at the
>series of postings on "Duff's device" (an unrolled loop) -- systems without
>instruction caches (or with large ones :-) tend to produce programs that use
>Duff's device, those with small caches encourage using tight loops instead.
>If somebody's compiler doesn't do induction on array index expressions, they
>tend to write critical loops using pointers.  Etc, etc.  I'd guess that an
>awful lot of Unix programs have been tuned to whatever it is that pcc does
>or doesn't do.  The point is, large real programs tend to have long
>histories that bias them in favor of old compiler technology and
>architectures.
Most application software doesn't worry about this kind of thing very much:
the 3rd-party folksd worry most about making things work across lots of
machines.
>
>Another problem with large real programs is that it's often very difficult
>to tell what the benchmark results mean.  Does nroff run fast on system Q
>because Q does stream I/O especially well, or because Q is really good at
>optimizing some 10-line inner loop that shoves around characters?  If I
>can't read the code or tell where it's spending its time, how can I possibly
>relate a benchmark result to some different program or application?
>Personally, I get a lot more insight out of a few hundred lines of good test
>cases that I can understand in detail.  
This is certainly true, although good measurement tools help you figure out
where the time is going.  Of course, if you have small benchmarks that
give you good correlation with what you actually use, then you're OK,
adn there's nothing wrong with using them, i.e., by definition, you're
using something correlated with youre real applications.  One of the points
we've tried to make is that one must be very careful when using simple
benchmarks to predict the performance across wider ranges of architecture
and software.  For example, simple benchmarks used to analyze PC-class
machines, don't encessarily work very well for larger ones.  (For PC-class
machines, you can probably geta first-order prediction by knowing
clock-rate, CPU type, and memory-latency).
>
>Now, I'm all in favor of benchmarking large real programs, particularly the
>ones that *I* like to run.  They also make a very nice sanity check to guard
>against silly benchmark deficiencies like do-nothing loops and results that
>can be determined at compile time.  But if cost constraints make me pick one
>or the other, I'll take the suite of synthetic tests any day.

It is, of course, a goal for many people in this to create small synthetic
benchmarks that accurately predict the behavior on large real applications,
and this is a very desirable goal.  It's merely hard! 
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086