Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!amdcad!ames!mailrus!cornell!uw-beaver!uw-june!rik
From: rik@june.cs.washington.edu (Rik Littlefield)
Newsgroups: comp.arch
Subject: Re: Benchmarking
Summary: There are problems with "large real" programs, too.
Message-ID: <6001@june.cs.washington.edu>
Date: 9 Oct 88 16:48:27 GMT
References: <2220003@hpausla.HP.COM> <46500026@uxe.cso.uiuc.edu> <1988Oct9.011633.13259@utzoo.uucp>
Organization: U of Washington, Computer Science, Seattle
Lines: 30

Many postings in this stream seem to assume that "large, real" programs are
somehow the most fair to use for benchmarking.  That's not necessarily true.
Any program that has had all or most of its development on a single system
has undoubtedly been tuned for best performance ON THAT SYSTEM.  Look at the
series of postings on "Duff's device" (an unrolled loop) -- systems without
instruction caches (or with large ones :-) tend to produce programs that use
Duff's device, those with small caches encourage using tight loops instead.
If somebody's compiler doesn't do induction on array index expressions, they
tend to write critical loops using pointers.  Etc, etc.  I'd guess that an
awful lot of Unix programs have been tuned to whatever it is that pcc does
or doesn't do.  The point is, large real programs tend to have long
histories that bias them in favor of old compiler technology and
architectures.

Another problem with large real programs is that it's often very difficult
to tell what the benchmark results mean.  Does nroff run fast on system Q
because Q does stream I/O especially well, or because Q is really good at
optimizing some 10-line inner loop that shoves around characters?  If I
can't read the code or tell where it's spending its time, how can I possibly
relate a benchmark result to some different program or application?
Personally, I get a lot more insight out of a few hundred lines of good test
cases that I can understand in detail.  

Now, I'm all in favor of benchmarking large real programs, particularly the
ones that *I* like to run.  They also make a very nice sanity check to guard
against silly benchmark deficiencies like do-nothing loops and results that
can be determined at compile time.  But if cost constraints make me pick one
or the other, I'll take the suite of synthetic tests any day.

--Rik