Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!ut-sally!husc6!linus!philabs!prls!mips!earl
From: earl@mips.UUCP (Earl Killian)
Newsgroups: comp.arch,comp.sys.nsc.32k
Subject: Re: "Unoptimizing" Dhrystone
Message-ID: <312@gumby.UUCP>
Date: Fri, 17-Apr-87 11:32:42 EST
Article-I.D.: gumby.312
Posted: Fri Apr 17 11:32:42 1987
Date-Received: Tue, 21-Apr-87 04:33:45 EST
References: <4190@nsc.nsc.com> <951@moscom.UUCP> <2577@intelca.UUCP> <999@mips.UUCP>
Lines: 90
Summary: look at whetstone
Xref: mnetor comp.arch:1022 comp.sys.nsc.32k:94

>In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
>Meanwhile, any advice on modifying the Dhrystone for version 1.2 such
>that a global optimizer won't be able to remove anything will be
>appreciated.

I'd like to second Steve Correll's comments and go a bit farther.

I hate to praise whetstone, but if you're looking for ways to improve
dhrystone, then you ought to start by looking at whetstone.  The only
thing dhrystone appears to have borrowed from whetstone so far is the
name, and this is unfortunate.  whetstone (1) does output so that the
program isn't simply thrown away; (2) does input so that the entire
computation cannot be done at compile time; (3) prints results so that
the correctness of the run can be verified; (4) uses a technique for
subtracting out loop overhead that does not involve timing an empty
loop that is likely to be deleted by a compiler; (5) makes every
computation's value used later so compilers don't just delete the
computation (often in conjunction with #3 above); (6) makes all
computations in loops depend on variables that change on every loop
iteration, thus preventing hoisting.  The only optimization that I
know of that begins to trash whetstone is inlining subroutines (you
can also apply angle addition formulas to save a call or two to
SIN/COS if you really want to tailor your compiler to the benchmark).

Dhrystone does none of these things.  It represents a step backward in
the art of benchmarking.  If you want to improve dhrystone, bring it
up to the standards of the twenty year old whetstone.

I started off saying I hated to praise whetstone.  The reason is that
I don't think it is particularly representative.  Neither is
dhrystone.  Both claim to be synthetic benchmarks created from looking
at statistics of source language constructs.  This sounds reasonable
on the surface, but for some reason this approach seems to generate
very different low-level statistics than real programs.

For example, dhrystone averages 15 instructions per procedure call on
a VAX.  The call-intensive programs I've studied average 50
instructions per call (30 is the lowest real program number I saw).
For example, I know of a pascal compiler that produces 8200 dhrystones
on an VAX/8600 when you use CALLS for the procedure calls and 14000
when you use a faster procedure call protocol.  This is 70% more
dhrystones!  And yet that compiler itself only speeds up by 20% when
you use the better procedure call protocol.

Another example: this forum has already noted the importance of strcpy
to the C version of dhrystone, and questioned that it is quite as
important in real programs.

Another example: whetstone is half a test of you floating point math
library and not the compiler or hardware.  Some real floating point
programs (e.g. spice) do spend a lot of time in SQRT, LOG, EXP, etc.
but not as much as whetstone.  And other real programs spend almost no
time there.

On the other hand, perhaps it is feature that these benchmarks do go
make people produce good libraries.  :-)

Another problem with these benchmarks is that they're small.  They fit
trivially into caches, for example.  For real programs, caches serve
not to make your program fit, but to make address the impedance
mismatch between main memory and the cpu.  You still need memory
bandwidth proportional to your processing power (although you wouldn't
know it from these small benchmarks) but raw memory bandwidth isn't
well matched to the appetite of a processor.  Thus caches.

Writing a big benchmark is not easy, of course, nor is porting it.
On the other hand, ask yourself why we're trying to build faster and
faster computers.  The reason is so that we can run bigger and bigger
programs.  Small benchmarks are becoming more and more misleading.
And one benchmark won't characterize all your applications either.
At MIPS, we use the following programs for benchmarking and
architectural study:
	ccom		-- C front-end
	uopt		-- global optimizer
	as1		-- assembler second pass
	nroff		-- text formatter
	TeX		-- another formatter with very different statistics
	compress	-- data compression
	espresso	-- PLA reduction
	hspice/spice2g6/spice3a7	-- circuit simulation
	timber wolf	-- routing
	doduc		-- monte carlo nuclear reactor simulation
	linpack		-- matrix reduction
(There are more that we want to study, but don't have yet.)  All of
these have very different statistics.  None fit well in 8Kb caches.
Some fit in 64Kb caches; others still do not.  Some use floating point
heavily; some don't.  Some want large cache blocks; others want small.
Some need very large virtual to physical translation caches; others
need only a few entries.  Linpack can use vector ops; the others
probably cannot.  Etc. etc.