Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!gatech!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: Approximate MFLOPS Message-ID: Date: 22 Nov 90 14:09:57 GMT References: <1131@cnw01.storesys.coles.oz.au> <6760@uceng.UC.EDU> Sender: usenet@ee.udel.edu Organization: College of Marine Studies, U. Del. Lines: 82 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: mfinegan@uceng.UC.EDU's message of 21 Nov 90 16:35:17 GMT > On 21 Nov 90 16:35:17 GMT, mfinegan@uceng.UC.EDU (michael k finegan) said: michael> Does anyone have some reasonable way of measuring MFLOPS (Linpack michael> or otherwise :-)) ? Could they email/post the code, or mix used ? michael> mfinegan@uceng.uc.edu I would normally reply by e-mail, but the word "reasonable" caught my attention and I would like to know if my idea of "reasonable" matches anyone else's. First, the LINPACK 100x100 test is a fairly exact and repeatable measure of MFLOPS. It is only "fairly" exact since the vendor is allowed to re-write the BLAS-1 (Basic Linear Algebra Subroutines) however s/he wants in order to improve performance. On the other hand, the vendor is *not* allowed to modify the source at all -- not even the comments! Presumably this is intended to mimic the situation of running large dusty decks with no time or expertise available for detailed optimization. On the down side, the test is awfully small. The array has only 10,000 elements, which means it can be fully contained in a 128kB cache for the 64-bit problem and in a 64kB cache for the 32-bit problem. For machines with vector units or pipelined FPU's, the overhead of calling the BLAS routines can be large compared with the time required to process the average vector length of 67 elements. (By the way, it has never been clear to me if producing an "unsafe" BLAS is legal for this test. By that I mean a BLAS which assumes only strides of 1 and which removes all the silly "IF" tests in the original source. If the compilation system were smart enough to use the "unsafe" BLAS only for the LINPACK library (probably by inlining) and the "safe" BLAS for direct user calls, then this should be workable. I have seen significant performance improvements on vector machines (ETA-10 and Cray Y/MP) by following this approach.) There are two other LINPACK tests of interest. The LINPACK 300x300 test was devised to use the level-2 BLAS (Matrix-Vector ops) and to allow a slightly larger problem. This produced huge speed improvements on the Cray machines, but was a disaster for the memory-to-memory Cyber 205/ETA-10 because of the non-unit strides employed. In any event, this test is the least popular, and has subsequently been dropped from the report. The LINPACK 1000x1000 test is often called the "anything goes" test. The vendor is allowed to solve the system of equations in any way s/he sees fit, including hand-coding the entire solver in assembly language. The only requirements are that the original driver code be used and that the MFLOPS calculation be based on the number of operations required for the original LU-decomposition code. Almost all vendors of high-performance machines have been able to achieve something close to their hardware peak performance on this test. In the interests of politeness, I will not mention any names here of those who could *not* do so well -- they are all in the report (see below). My experience has been that the LINPACK 1000x1000 test (using the vendor's best technique) is a good estimate of the *real* peak performance of a computer. *Very* seldom is it possible to write user code that performs at a significantly higher MFLOPS rate. I have also noticed that for code with very simple vector constructs (simple dyads and triads, with few library functions) the LINPACK 100x100 case gives a surprisingly good estimate of the performance attainable by "real" codes. By "surprisingly good", I mean that my real codes almost always run at speeds within a factor of 2 of the LINPACK 100x100 results. The largest differences are with those machines whose cache refills are slow (like my Silicon Graphics 4D/25) which run close to a factor of two slower on large codes than on the LINPACK 100x100. Enough rambling. The LINPACK codes and the paper tabulating the results are available from the netlib server. Send an e-mail message to netlib@ornl.gov, with the text: send index for benchmark and the server will send an e-mail message back listing the names and descriptions of the benchmark codes and other related material. To get the LINPACK 100x100 single-precision code, for example, send a message with the text: send linpacks from benchmark Have fun.... -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET