Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!ucsd!sdd.hp.com!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.benchmarks Subject: Re: SPEC vs. Dhrystone Message-ID: <44465@mips.mips.COM> Date: 3 Jan 91 06:18:59 GMT References: <44342@mips.mips.COM> <15379@ogicse.ogi.edu> <44353@mips.mips.COM> <1685@marlin.NOSC.MIL> <15546@ogicse.ogi.edu> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Distribution: comp.benchmarks Organization: MIPS Computer Systems, Inc. Lines: 140 In article <15546@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes: >In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes: >>In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>>(Note, for example, that published Dhrystone results easily mis-predict >>>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine >>>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more >>>realistic integer benchmarks.) >>This is an interesting observation (result). >>Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer >>programs. That is, hundreds (I believe) of programs were >>analyzed to come up with the (ahem) 'typical' high level >>language instructions and their frequency of usage. In view of this >>I would, at first sight, suspect the Dhrystone to be more accurate >>than SPEC as SPEC is based upon only a few integer programs. >I suspect there are two factors at work here. First, Dhrystone is a >fairly small benchmark, and would not exercise the memory hierarchy >as hard as the real programs in SPEC. The second factor is that it is >easier to tune a compiler to a small benchmark like Dhrystone (or for >that matter Whetstone and the Livermore Loops) than it is to tune for >a variety of different real programs. By the way, I believe Dhrystone >was originally written in ADA and was translated to "C", the form in >which it is usually run. >>Real programs also show a great variation in performance. I noticed Well, there are multiple issues, of which several have been mentioned already here. Let us review the history of Dhrystone, which was originally, as stated, a reasonable attempt to model ADA usage, and then got converted into C. ISSUE 1: building small synthetic benchmark that is TYPICAL of intended usage is 1a: extremely difficult to do, even for a current set of hardware and software 1b: REALLY hard to do and expct to remain valid over time< even in the absence of 1c 1c: really hard to do, if small enough to be subject to compiler gimmickkry Let us take each of these in turn: 1a: is hard, because the usual (and reasonable) methodology is: 1: select attributes that should be measured 2: gather statistics from a set of programs 3: build the benchmark to model those attributes The problem is: 1: you may not choose the "right" attributes, and in fact, there is no small set of right attributes. in fact, there are only better an better approximations, even to modeling a single user program (not even a large mix). For example, suppose your first approximation is: count the number of +, -, *, and /. executed Not a very good approximation so add, number of if statements Still not good so add, distribution of sizes of expressions Still not good so add, number of function calls Still not good, so add, distribution of number of arguments of function calls (makes bigger difference amongst machines that pass some arguments in registers) Still not good, some architectures (like SPARC) can be sensitive to the depth of function calls, so do somethign about that Still not good, haven't done anything about array indexing, and different architectures react differently Still not good, haven't done anything about pointers, so add some pointer references Still not good, you haven't measured the frequency of different- sized offsets from pointers (and surprise! in some architectures there is no different between a zero=offset, and a non-zero offset; in others (such as AMD29K), a zero offset is cheaper than non-zero; in others, presence of particular addressing modes helps some combinations much more than others. Still not good: how often is the same pointer->object referenced close enough in the code, and under such conditions that the compiler can jjust leave it in a register? Still not good: is the distribution of variable referencees such that the benchmartk will model the effects of a good register allocator, or not. ..... and on, and on.... I.e., it is VERY easy to do a competent job of feature extraction and modeling, and still get surprised, where "surprise" = the synthetic benchmark doesn't correlate well with realistic code of the class that it was supposed to model. (I've looked at many synthetic benchmarks with our tools; the numbers quite often don't look anything like what you see wheen you analyze real programs.) 1b: hard to do over time: If asked to compare machines that basically differ only by the clock rate (same CPU, same compilers), a small benchmark is adequate. However, hardware tends to get more complex over time, in particular, faster machines use caches, caches get bigger, multi-level caches appear, etc.. Programs expand to use these; if a benchmark doesn't also expand appropriately, it starts to measure only the smallest part of the memory hierarchy. In addition, optimizing compilers get better, and they optimize away pieces of the code, especially in a small synthetic benchmark. 1C: Compiler gimmickry For any importnat benchmark that is small, compilers will get tuned in ways that are absolutely useless in real life. This has happened at least with Whetstone, Dhrystone, and LINPACK. ISSUE 2: Dhrystone in particular The MIPS Performance Brief, Issue 3.9 (and earlier) has had analyses of Dhrystone issues, for years. here is a brief summary: 1. Small, will fit in tiny instruction and data caches 2. References and re- models effects of write-back & write-thru caches poorly 3. Subroutine calls sahllow depth, hence never underflows/overflows on aa register window/stack cache machine. 4. Makes function calls more frequently than any real program I've ever seen, i.e., on a MIPS, uses <40 cycles per call, whereas 60-100 is much more typical of C programs 5. Can easily spend 30% of it's time in strcpy, unlike any real program I've ever analyzed. Due to the particular use (copy a 30-byte constant, over and over again), it is especially amenable to gimmickry, such as compiler options which generate incorrect code for real use, but happen to work for Dhrystone. (Note that some of this is an artifact of translation from ADA/Pascal (fixed-length strings) to C.) Most amusing code: i860, where the 30-byte constant is expanded to 32, and is then copied with 2 16-byte loads, followed by 2 16-byte stores, not very typical of real C-language string processing, where most pointers are to variables whose sizes are unknown at compile time.... 6. There is an unusually high frequency of zero-offset pointers. 7. In the earlier versions, there was obvious dead code, which started to disappear under the pressure of better optimizers. (not gimmickry, just better compilers). 8. Also, the earlier version never worried about compilers that can merge the whole program together and inline EVERYTHING... So, Reinhold W. started with somethign that was actually a reasonable attempt, but it is HARD TO DO, and even HARDER to keep sensible... -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086