Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!gatech!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: Price/Performance figures for Number-Crunching Message-ID: Date: 20 Mar 91 21:31:18 GMT References: <1991Mar20.104926@IASTATE.EDU> Sender: usenet@ee.udel.edu Organization: College of Marine Studies, U. Del. Lines: 104 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: carter@IASTATE.EDU's message of 20 Mar 91 16:49:26 GMT > On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said: Carter> We found your LINPACK/peak MFLOPS table quite interesting, and Carter> have some data to add to it; we applaud the use of "stream Carter> MFLOPS," which we generally refer to as "Level 1 BLAS MFLOPS" Carter> around here. Your "MFLOPS Max" is what we would call "Level 3 Carter> BLAS MFLOPS", meaning that operands are re-used and processors Carter> that are bandwidth-starved can still show high numbers. I am glad you found it interesting.... Carter> However, we find this kind of performance evaluation to be Carter> limited and misleading. MFLOPS measures correlate poorly with Carter> actual performance on complete applications, and the LINPACK Carter> measure is particularly inaccurate in that it ignores all but Carter> a particular, not very common, form of linear algebra Carter> operation. What you mean is that you find the LINPACK MFLOPS correlates poorly with *your* complete applications. I find pretty good correlation with *my* complete applications. As for someone *else's* applications, the appropriate response is, "Well, it depends...." But in any event, please notice that I only used the LINPACK 1000x1000 hand-optimized number to calculate a price/performance ratio (Max MFLOPS/million $). The other price/performance number (Stream MFLOPS/million $) comes from other sources. The LINPACK 1000x1000 number is a particularly *good* estimate of the maximum speed attainable by cache-friendly applications on vector and mildly parallel machines (ncpus=2,4,8 but not ncpus=256). It is not intended to be a good estimate of the speed of anyone's application. The "Stream MFLOPS" is a good estimator (in *my* experience) of the performance of well-structured vectorizable codes with no specific optimizations. Carter> [....] LINPACK cannot assess parallel computers because of Carter> its ethnocentric uniprocessor FORTRAN rules, and it does not Carter> scale to the amount of computing power available. Do people Carter> buy computers to perform MFLOPS or to solve problems? You are getting carried away by your own propaganda here. The LINPACK 100x100 test case has some specific rules. Following those rules allows one to make specific statements about the results that could not be made in the absence of those "ethnocentric uniprocessor FORTRAN rules". It is certainly true that these rules do not allow massively parallel machines to show off their best potential. So what? If you don't like those rules, make up another benchmark. I would suggest something like solving Laplace's equation on a 512x512x512 finite-difference grid. (That's 128 MW of data, and is probably big enough for just about any massively parallel computer). Carter> All measurements are for 64-bit, IEEE floating-point Carter> arithmetic. (You need to state this in your table. Some Carter> vendors, such as Convex, are notorious for citing 32-bit Carter> MFLOPS and hoping you won't notice.) (1) I thought that my table was clear enough. Dongarra's report only contains 64-bit results and my "Stream MFLOPS" are clearly based on 64-bit arithmetic. (2) Are you sure you have your libel in order here? The Convex machines do not run noticeably faster in 32-bits than 64-bits, so why would they bother? CDC is the company that would have had something to gain, but their results were generally clearly labelled as well.... Carter> Another problem with MFLOPS and LINPACK is "benchmark rot". Carter> The LINPACK benchmark has become so important that compilers Carter> now have "LINPACK recognizers" that drop in super-optimized Carter> code whenever they see a structure that looks like the LINPACK Carter> kernel. Documentation? Or is this another "urban myth"? Carter> We have found that LINPACK overpredicts actual Carter> application performance by an order of magnitude for some Carter> computers... FPS, Stardent, Convex, Alliant, and to some Carter> extent, CRAY, are all guilty. So you might find that Carter> traditional vector computers aren't holding up that well when Carter> asked to do something other than DAXPY with unit stride! Which LINPACK number overpredicts performance by an order of magnitude? The LINPACK 100x100 test case gives very reasonable numbers for vectorizable applications. If your applications are running at an order of magnitude slower than the LINPACK 100x100 numbers, then you are doing something seriously wrong -- either in implementing your code or in choosing what machine to run it on.... On the other hand, the LINPACK 1000x1000 numbers could be an order of magnitude faster than your application, especially on parallel machines. That test case was never intended to give you an estimate of your program performance, it was intended to verify that it is possible to write an application that runs at a substantial fraction of the machine's peak speed. Thus machines like the Alliant FX/80 stuck out like a sore thumb, since the best performance was only 1/3 of the peak advertised performance (69 MFLOPS vs 188 MFLOPS). -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET