Path: utzoo!yunexus!geac!syntron!jtsv16!uunet!husc6!bloom-beacon!think!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Benchmarking Message-ID: <5356@winchester.mips.COM> Date: 11 Oct 88 17:15:27 GMT Article-I.D.: winchest.5356 References: <2220003@hpausla.HP.COM> <46500026@uxe.cso.uiuc.edu> <1988Oct9.011633.13259@utzoo.uucp> <6001@june.cs.washington.edu> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 71 In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >Many postings in this stream seem to assume that "large, real" programs are >somehow the most fair to use for benchmarking. That's not necessarily true. As we've said numerous times, the best benchmark for anybody is for them to run their own real applications, because such applications obviously have the highest correlation with what they'll see in real use. When I keep saying "use large, real programs", it's because I usually have in front of me numerous statistics about the behavior of programs that show that most of the toy benchmarks aren't very good predictors of the real applications, especially when applied to the higher-performance designs. Why is this? a) Toys don't stress cache designs, so that small caches and large ones act about the same, which is simply untrue for many real programs. (in this case, "cache" includes any place in the memory herarchy, including registers, stack caches, register windows, 1-to-n-level of memory caches, disk caches in main memory, etc. b) Toys don't stress limits. For example, consider the performance differences attributable to the different X86 memory models. c) Toys don't stress software. Anybody can compile Dhrystone or Whetstone, and many can optimize them. Compiling/optimizing Spice tells you a lot more. >Any program that has had all or most of its development on a single system >has undoubtedly been tuned for best performance ON THAT SYSTEM. Look at the >series of postings on "Duff's device" (an unrolled loop) -- systems without >instruction caches (or with large ones :-) tend to produce programs that use >Duff's device, those with small caches encourage using tight loops instead. >If somebody's compiler doesn't do induction on array index expressions, they >tend to write critical loops using pointers. Etc, etc. I'd guess that an >awful lot of Unix programs have been tuned to whatever it is that pcc does >or doesn't do. The point is, large real programs tend to have long >histories that bias them in favor of old compiler technology and >architectures. Most application software doesn't worry about this kind of thing very much: the 3rd-party folksd worry most about making things work across lots of machines. > >Another problem with large real programs is that it's often very difficult >to tell what the benchmark results mean. Does nroff run fast on system Q >because Q does stream I/O especially well, or because Q is really good at >optimizing some 10-line inner loop that shoves around characters? If I >can't read the code or tell where it's spending its time, how can I possibly >relate a benchmark result to some different program or application? >Personally, I get a lot more insight out of a few hundred lines of good test >cases that I can understand in detail. This is certainly true, although good measurement tools help you figure out where the time is going. Of course, if you have small benchmarks that give you good correlation with what you actually use, then you're OK, adn there's nothing wrong with using them, i.e., by definition, you're using something correlated with youre real applications. One of the points we've tried to make is that one must be very careful when using simple benchmarks to predict the performance across wider ranges of architecture and software. For example, simple benchmarks used to analyze PC-class machines, don't encessarily work very well for larger ones. (For PC-class machines, you can probably geta first-order prediction by knowing clock-rate, CPU type, and memory-latency). > >Now, I'm all in favor of benchmarking large real programs, particularly the >ones that *I* like to run. They also make a very nice sanity check to guard >against silly benchmark deficiencies like do-nothing loops and results that >can be determined at compile time. But if cost constraints make me pick one >or the other, I'll take the suite of synthetic tests any day. It is, of course, a goal for many people in this to create small synthetic benchmarks that accurately predict the behavior on large real applications, and this is a very desirable goal. It's merely hard! -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086