Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!amdcad!ames!mailrus!cornell!uw-beaver!uw-june!rik From: rik@june.cs.washington.edu (Rik Littlefield) Newsgroups: comp.arch Subject: Re: Benchmarking Summary: There are problems with "large real" programs, too. Message-ID: <6001@june.cs.washington.edu> Date: 9 Oct 88 16:48:27 GMT References: <2220003@hpausla.HP.COM> <46500026@uxe.cso.uiuc.edu> <1988Oct9.011633.13259@utzoo.uucp> Organization: U of Washington, Computer Science, Seattle Lines: 30 Many postings in this stream seem to assume that "large, real" programs are somehow the most fair to use for benchmarking. That's not necessarily true. Any program that has had all or most of its development on a single system has undoubtedly been tuned for best performance ON THAT SYSTEM. Look at the series of postings on "Duff's device" (an unrolled loop) -- systems without instruction caches (or with large ones :-) tend to produce programs that use Duff's device, those with small caches encourage using tight loops instead. If somebody's compiler doesn't do induction on array index expressions, they tend to write critical loops using pointers. Etc, etc. I'd guess that an awful lot of Unix programs have been tuned to whatever it is that pcc does or doesn't do. The point is, large real programs tend to have long histories that bias them in favor of old compiler technology and architectures. Another problem with large real programs is that it's often very difficult to tell what the benchmark results mean. Does nroff run fast on system Q because Q does stream I/O especially well, or because Q is really good at optimizing some 10-line inner loop that shoves around characters? If I can't read the code or tell where it's spending its time, how can I possibly relate a benchmark result to some different program or application? Personally, I get a lot more insight out of a few hundred lines of good test cases that I can understand in detail. Now, I'm all in favor of benchmarking large real programs, particularly the ones that *I* like to run. They also make a very nice sanity check to guard against silly benchmark deficiencies like do-nothing loops and results that can be determined at compile time. But if cost constraints make me pick one or the other, I'll take the suite of synthetic tests any day. --Rik