Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ut-sally!husc6!linus!philabs!prls!mips!earl From: earl@mips.UUCP (Earl Killian) Newsgroups: comp.arch,comp.sys.nsc.32k Subject: Re: "Unoptimizing" Dhrystone Message-ID: <312@gumby.UUCP> Date: Fri, 17-Apr-87 11:32:42 EST Article-I.D.: gumby.312 Posted: Fri Apr 17 11:32:42 1987 Date-Received: Tue, 21-Apr-87 04:33:45 EST References: <4190@nsc.nsc.com> <951@moscom.UUCP> <2577@intelca.UUCP> <999@mips.UUCP> Lines: 90 Summary: look at whetstone Xref: mnetor comp.arch:1022 comp.sys.nsc.32k:94 >In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes: >Meanwhile, any advice on modifying the Dhrystone for version 1.2 such >that a global optimizer won't be able to remove anything will be >appreciated. I'd like to second Steve Correll's comments and go a bit farther. I hate to praise whetstone, but if you're looking for ways to improve dhrystone, then you ought to start by looking at whetstone. The only thing dhrystone appears to have borrowed from whetstone so far is the name, and this is unfortunate. whetstone (1) does output so that the program isn't simply thrown away; (2) does input so that the entire computation cannot be done at compile time; (3) prints results so that the correctness of the run can be verified; (4) uses a technique for subtracting out loop overhead that does not involve timing an empty loop that is likely to be deleted by a compiler; (5) makes every computation's value used later so compilers don't just delete the computation (often in conjunction with #3 above); (6) makes all computations in loops depend on variables that change on every loop iteration, thus preventing hoisting. The only optimization that I know of that begins to trash whetstone is inlining subroutines (you can also apply angle addition formulas to save a call or two to SIN/COS if you really want to tailor your compiler to the benchmark). Dhrystone does none of these things. It represents a step backward in the art of benchmarking. If you want to improve dhrystone, bring it up to the standards of the twenty year old whetstone. I started off saying I hated to praise whetstone. The reason is that I don't think it is particularly representative. Neither is dhrystone. Both claim to be synthetic benchmarks created from looking at statistics of source language constructs. This sounds reasonable on the surface, but for some reason this approach seems to generate very different low-level statistics than real programs. For example, dhrystone averages 15 instructions per procedure call on a VAX. The call-intensive programs I've studied average 50 instructions per call (30 is the lowest real program number I saw). For example, I know of a pascal compiler that produces 8200 dhrystones on an VAX/8600 when you use CALLS for the procedure calls and 14000 when you use a faster procedure call protocol. This is 70% more dhrystones! And yet that compiler itself only speeds up by 20% when you use the better procedure call protocol. Another example: this forum has already noted the importance of strcpy to the C version of dhrystone, and questioned that it is quite as important in real programs. Another example: whetstone is half a test of you floating point math library and not the compiler or hardware. Some real floating point programs (e.g. spice) do spend a lot of time in SQRT, LOG, EXP, etc. but not as much as whetstone. And other real programs spend almost no time there. On the other hand, perhaps it is feature that these benchmarks do go make people produce good libraries. :-) Another problem with these benchmarks is that they're small. They fit trivially into caches, for example. For real programs, caches serve not to make your program fit, but to make address the impedance mismatch between main memory and the cpu. You still need memory bandwidth proportional to your processing power (although you wouldn't know it from these small benchmarks) but raw memory bandwidth isn't well matched to the appetite of a processor. Thus caches. Writing a big benchmark is not easy, of course, nor is porting it. On the other hand, ask yourself why we're trying to build faster and faster computers. The reason is so that we can run bigger and bigger programs. Small benchmarks are becoming more and more misleading. And one benchmark won't characterize all your applications either. At MIPS, we use the following programs for benchmarking and architectural study: ccom -- C front-end uopt -- global optimizer as1 -- assembler second pass nroff -- text formatter TeX -- another formatter with very different statistics compress -- data compression espresso -- PLA reduction hspice/spice2g6/spice3a7 -- circuit simulation timber wolf -- routing doduc -- monte carlo nuclear reactor simulation linpack -- matrix reduction (There are more that we want to study, but don't have yet.) All of these have very different statistics. None fit well in 8Kb caches. Some fit in 64Kb caches; others still do not. Some use floating point heavily; some don't. Some want large cache blocks; others want small. Some need very large virtual to physical translation caches; others need only a few entries. Linpack can use vector ops; the others probably cannot. Etc. etc.