Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!sun-barr!newstop!texsun!convex!news From: patrick@convex.COM (Patrick F. McGehearty) Newsgroups: comp.benchmarks Subject: Re: Price/Performance figures for Number-Crunching Message-ID: <1991Mar21.000302.10103@convex.com> Date: 21 Mar 91 00:03:02 GMT References: <1991Mar20.104926@IASTATE.EDU> Sender: news@convex.com (news access account) Reply-To: patrick@convex.COM (Patrick F. McGehearty) Organization: Convex Computer Corporation, Richardson, Tx. Lines: 89 Nntp-Posting-Host: mozart.convex.com In article mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: >> On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said: > lots of interesting discussion of LINPACK vs $$ vs other things. ... ...until we get to here...which leads to what I want to comment on ... > >Carter> Another problem with MFLOPS and LINPACK is "benchmark rot". >Carter> The LINPACK benchmark has become so important that compilers >Carter> now have "LINPACK recognizers" that drop in super-optimized >Carter> code whenever they see a structure that looks like the LINPACK >Carter> kernel. > Mccalpin>Documentation? Or is this another "urban myth"? > I would recommend the phrase "benchmark smoothing" to "benchmark rot", in the sense that the benchmark has been used like sandpaper to smooth rough edges out of the compiler. The smaller the benchmark, the coarser and more specific the smoothing. Thus, any benchmark which has been widely used for a number of years will be much less likely to trip over compiler weaknesses than newer benchmarks. Many customers use known benchmarks to select their first round of bidders, and then ask that their own benchmarks be run for final selection purposes. This procedure explains why having "LINPACK only recognizers" is not a winning approach. Yes, you get invited to bid, but no, you don't win the business. I'm sure they exist, but are quickly found to be a waste of effort. On the compiler development side (where I sit), LINPACK certainly has been a "compiler smoother". Several of our current optimizations were driven by LINPACK. They are not specific to LINPACK, since they will work for most dense vector*vector or matrix*vector code, but they certainly are important to getting the LINPACK results we get. Is this the sort of "LINPACK recognizer" that Carter was refering too? If so, I am concerned about the apparent distain for "pattern recognizers". Pattern recognition is a critical tool in the optimizing compiler's tool bag. There are large numbers of such "pattern recognizers" in any good optimizing compiler, and a good compiler development group is continually trying to identify new ones that have at least moderate degrees of utility. So, I would argue that you should not distain the old benchmarks for "benchmark rot", but be wary of any single number to characterize an architecture. Look at several benchmark results (as many as you can find), and cross check the different machines on the different tests. When a large variety of tests show machine A to be twice as fast as machine B, then you can have more confidence of your result than when a single benchmark shows such a result. If the results vary, then understanding why will lead to better understanding of how the machines would work in your environment. I agree with McCalpin that the Linpack 1000x1000 data to provide a good 'truth in advertising' comparison with "PEAK/guaranteed not to exceed" numbers quoted by marketing glossies. Most applications will not approach those rates without some serious optimization effort, but at least you know what the potentials are. Data for new systems may be subject to later improvement due to better understanding of the system by the benchmarkers, but otherwise, it is a useful guideline. That problem size should be sufficient for up to about 10 Gflop size systems. For larger and faster systems, I like the concept put forth by Carter of "how large of a problem can be solved in fixed time?". A debatable point is "what is the ideal fixed time?" Consider the following two extremes expressed: 1) Anything less than 1/25 of a second is 'apparently instant' from the human perception point of view. 2) For anything longer than a week, you forget why you ran the program. :-) I know there are exceptions on both ends, but these limits serve as convenient bounds for the domain of discourse. The "one minute" suggested by Carter has several good points. As a model of real work, it is small enough so that a a researcher will not switch to some other major activity before getting results. This allows the researcher to continue an incremental train of thought. It also is small enough for the benchmarker to run repeatedly with different configurations or tuning options. For these reasons, it is a useful length of time. However, it does not cover the 'overnight' run category. For this category of test, issues involving very large data sets and management of same come into play. It is an important area of benchmarking that is generally neglected due to the high cost of developing the benchmark and getting vendors to run and tune such a benchmark. Perhaps we (vendors and customers) could encourage the development of such 'super benchmarks' by supporting two versions of the same code, the one minute version and the one night version. Then, most tuning could be done with the one minute version, and the one night version could be used to confirm the effectiveness of the system for truly large problems. I seem to have wandered over several related topics, so I will stop now and wait for the net's comments.