Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!male!texsun!convex!convex.COM From: patrick@convex.COM (Patrick F. McGehearty) Newsgroups: comp.benchmarks Subject: Re: Don't use bc (was: More issues of benchmarking) Message-ID: <109872@convex.convex.com> Date: 5 Dec 90 16:02:14 GMT References: <1990Dec3.191756.15280@cs.utk.edu> <39871@ucbvax.BERKELEY.EDU> <1990Dec3.204027.16794@cs.utk.edu> Sender: usenet@convex.com Reply-To: patrick@convex.COM (Patrick F. McGehearty) Organization: Convex Computer Corporation, Richardson, Tx. Lines: 125 In article <1990Dec3.204027.16794@cs.utk.edu> Dave Sill writes: >In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes: >>In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes: >>> It [the bc bmark] may not be rigorous, but it does have value. >>> For one thing, it's short enough to be memorized and easily typed >>> at a box at, say, an expo. >> >>The bc "benchmark" is easy to measure, but the number is worthless. > >Why do you say that? It's using a known tool to perform a known task, >right? No, it won't tell whether machine X is faster than machine Y >with absolute certainty. But it will tell me *something*. > >I agree that both are valid tests. I don't agree that the bc test is >fundamentally different. It's a smaller version of the same idea. > >>The idea is to see how fast the computer runs >>programs that people actually use heavily. If a vendor's machine >>manages to do all these things fast, it will probably be fast on your >>real workload. > >Exactly, but you can't carry a SPEC tape with you wherever you go. I >*can* carry the bc test with me and very quickly determine which end >of the spectrum an unknown machine falls in. Will I base purchase >decisions on such an admittedly trivial test? Of course not. > I suggest that the bc benchmark is worse than worthless for several reasons. First, as pointed out, it is not measuring the raw add/multiply rate of the machine. It measures the "multi-precision arithmetic" capabilities as implemented by dc, which is mostly subroutine call/returns. Further, I have never seen a system where bc/dc is a significant user of cycles. Thus, the less than expert user will believe the measurements represent something different from reality. Second, for most machines, little architecture or compiler work has been (or should be) done to optimize this application. So you will not be able to tell the difference between those machines which have features useful to your application and those which do not. Third, widespread reporting of such a benchmark will encourage other, less knowledgable buyers to read more into the numbers than should be read. Fourth, if buyers use the benchmark, then vendors will be encouraged to put resources into enhancing their performance on it instead of enhancing something useful. This is a bad thing and the primary reason why I am posting. Bad benchmarks lead to lots of wasted effort. I use the Whetstone benchmark as a "proof by example". I know of several vendor development efforts (going back much more than 10 years, this is not a new phenomenon) which went to extreme efforts to improve their Whetstone results, including adding special microcode for certain instructions which the compiler only generated for for the Whetstone benchmark. Obviously, this particular trick only makes sense for the old style CISCy architectures, but you get the idea of what vendors will do to improve their benchmark results. There are similar stories for the Dhrystone benchmark. In these cases, the development efforts were not totally wasted. Efforts to speed up the transcendental functions (SIN, COS, etc) used in the Whetstones helped those applications which used the transcendentals. I see no value to most users of general purpose computing (scientific or business) in optimizing bc/dc. Many procurements require some minimum rate on a some well-known benchmark for a vendor to be even allowed to bid. If you can't make this number, you don't get a chance to show how good your architecture and compilers are for executing the customer's real application. There are even a significant number of customers who do not run benchmarks before purchase. They just rely on quoted numbers for well-known benchmarks. It is our duty as responsible professionals to develop and measure benchmarks that mean something and which explain what they mean. For workstations, the SPECmark benchmarks provide programs which are sufficiently complex as to avoid the trivial trick optimizations. If an optimization can make that set run faster, it will probably also apply to real application code. For scientific computers, the Perfect Club benchmarks serve the same purpose. They represent a dozen scientific applications with inner loops which cover a variety of code patterns which are found in real applications. The Livermore loops also have codes which are representative of inner loops of real applications. Improving these codes will improve real application performance. In a few years, their solution times will become so short as to require new problem definitions or data sets, but meanwhile, we in system development will have some meaningful metrics to work towards improving. If you really must have a "quick and dirty" benchmark, how about the following: program main real*8 a(256,256),b(256,256),c(256,256) call matmul(a,b,c,256) end subroutine matmul(a,b,c,n) real*8 a(n,n),b(n,n),c(n,n) do i = 1, n do j = 1, n c(i,j) = 0.0 do k = 1, n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo return end This is a basic matrix multiply loop which takes less than a second on a Convex C210. If you are running on a fast machine you might want to change 256 to 1024 (for 64 times more work). The matmul routine is separate from the main routine so that the optimizer cannot eliminate the work unless it performs interprocedural optimization or routine inlining. Be sure not to invoke any such options. This toy benchmark focuses on the floating point performance of the machine. It should show the architecture in a relatively favorable light if floating point is an important part of its product segment. It is large enough to blow most current cache systems if there is too great of a disparity between cache and non-cache processor performance. It is not hard to memorize or carry a copy around and type in. On a convex, execute with fc -O2 test.f -o test; /bin/time -e test -O2 requests the vectorization optimization. -O3 requests the parallel/vectorization optimization. The -e switch is a convex extension to /bin/time to provide extended accuracy to the microsecond level. Otherwise timing is only recorded to the nearest 100th of a second for compatibility with previous releases. Brought to you by Super Global Mega Corp .com