Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!male!texsun!convex!convex.COM
From: patrick@convex.COM (Patrick F. McGehearty)
Newsgroups: comp.benchmarks
Subject: Re: Don't use bc (was: More issues of benchmarking)
Message-ID: <109872@convex.convex.com>
Date: 5 Dec 90 16:02:14 GMT
References: <1990Dec3.191756.15280@cs.utk.edu> <39871@ucbvax.BERKELEY.EDU> <1990Dec3.204027.16794@cs.utk.edu>
Sender: usenet@convex.com
Reply-To: patrick@convex.COM (Patrick F. McGehearty)
Organization: Convex Computer Corporation, Richardson, Tx.
Lines: 125

In article <1990Dec3.204027.16794@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes:
>>In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes:
>>> It [the bc bmark] may not be rigorous, but it does have value.
>>> For one thing, it's short enough to be memorized and easily typed
>>> at a box at, say, an expo.  
>>
>>The bc "benchmark" is easy to measure, but the number is worthless.
>
>Why do you say that?  It's using a known tool to perform a known task,
>right?  No, it won't tell whether machine X is faster than machine Y
>with absolute certainty.  But it will tell me *something*.
>
>I agree that both are valid tests.  I don't agree that the bc test is
>fundamentally different.  It's a smaller version of the same idea.
>
>>The idea is to see how fast the computer runs
>>programs that people actually use heavily.  If a vendor's machine
>>manages to do all these things fast, it will probably be fast on your
>>real workload.
>
>Exactly, but you can't carry a SPEC tape with you wherever you go.  I
>*can* carry the bc test with me and very quickly determine which end
>of the spectrum an unknown machine falls in.  Will I base purchase
>decisions on such an admittedly trivial test?  Of course not.
>
I suggest that the bc benchmark is worse than worthless for several reasons.

First, as pointed out, it is not measuring the raw add/multiply rate
of the machine.  It measures the "multi-precision arithmetic"
capabilities as implemented by dc, which is mostly subroutine call/returns.
Further, I have never seen a system where bc/dc is a significant user of
cycles.  Thus, the less than expert user will believe the measurements
represent something different from reality.

Second, for most machines, little architecture or compiler work has
been (or should be) done to optimize this application.  So you will not be
able to tell the difference between those machines which have features
useful to your application and those which do not.

Third, widespread reporting of such a benchmark will encourage other,
less knowledgable buyers to read more into the numbers than should be
read.

Fourth, if buyers use the benchmark, then vendors will be encouraged
to put resources into enhancing their performance on it instead of
enhancing something useful.  This is a bad thing and the primary reason
why I am posting.  Bad benchmarks lead to lots of wasted effort.

I use the Whetstone benchmark as a "proof by example".  I know of several
vendor development efforts (going back much more than 10 years, this is not
a new phenomenon) which went to extreme efforts to improve their Whetstone
results, including adding special microcode for certain instructions which
the compiler only generated for for the Whetstone benchmark.  Obviously,
this particular trick only makes sense for the old style CISCy
architectures, but you get the idea of what vendors will do to improve their
benchmark results.  There are similar stories for the Dhrystone benchmark.
In these cases, the development efforts were not totally wasted.  Efforts to
speed up the transcendental functions (SIN, COS, etc) used in the Whetstones
helped those applications which used the transcendentals.  I see no value
to most users of general purpose computing (scientific or business) in
optimizing bc/dc.

Many procurements require some minimum rate on a some well-known benchmark
for a vendor to be even allowed to bid.  If you can't make this number, you
don't get a chance to show how good your architecture and compilers are for
executing the customer's real application.  There are even a significant
number of customers who do not run benchmarks before purchase.  They
just rely on quoted numbers for well-known benchmarks.  It is our duty
as responsible professionals to develop and measure benchmarks that mean
something and which explain what they mean.

For workstations, the SPECmark benchmarks provide programs which are
sufficiently complex as to avoid the trivial trick optimizations.  If an
optimization can make that set run faster, it will probably also apply to
real application code.  For scientific computers, the Perfect Club
benchmarks serve the same purpose.  They represent a dozen scientific
applications with inner loops which cover a variety of code patterns which
are found in real applications.  The Livermore loops also have codes
which are representative of inner loops of real applications.  Improving
these codes will improve real application performance.  In a few years,
their solution times will become so short as to require new problem
definitions or data sets, but meanwhile, we in system development will have
some meaningful metrics to work towards improving.

If you really must have a "quick and dirty" benchmark, how about the
following:

	program main
	real*8 a(256,256),b(256,256),c(256,256)
	call matmul(a,b,c,256)
	end
	subroutine matmul(a,b,c,n)
	real*8 a(n,n),b(n,n),c(n,n)
	do i = 1, n
	do j = 1, n
	  c(i,j) = 0.0
	  do k = 1, n
	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
	  enddo
	enddo
	enddo
	return
	end

This is a basic matrix multiply loop which takes less than a second
on a Convex C210.  If you are running on a fast machine you might want to
change 256 to 1024 (for 64 times more work).  The matmul routine is separate
from the main routine so that the optimizer cannot eliminate the work unless
it performs interprocedural optimization or routine inlining.  Be sure
not to invoke any such options.

This toy benchmark focuses on the floating point performance of the
machine.  It should show the architecture in a relatively favorable
light if floating point is an important part of its product segment.
It is large enough to blow most current cache systems if there is
too great of a disparity between cache and non-cache processor performance.

It is not hard to memorize or carry a copy around and type in.
On a convex, execute with fc -O2 test.f -o test; /bin/time -e test
-O2 requests the vectorization optimization.
-O3 requests the parallel/vectorization optimization.
The -e switch is a convex extension to /bin/time to provide extended
accuracy to the microsecond level.  Otherwise timing is only recorded
to the nearest 100th of a second for compatibility with previous releases.


Brought to you by Super Global Mega Corp .com