Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: A question about flat out Snake speed. Message-ID: Date: 8 Apr 91 23:48:20 GMT References: <7834@idunno.Princeton.EDU> <45780003@hpcupt3.cup.hp.com> Sender: usenet@ee.udel.edu Organization: College of Marine Studies, U. Del. Lines: 55 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: daryl@hpcupt3.cup.hp.com's message of 6 Apr 91 20:38:43 GMT >>>>> On 6 Apr 91 20:38:43 GMT, daryl@hpcupt3.cup.hp.com (Daryl Odnert) said: Daryl> Here are the key things to consider in formulating an answer to Daryl> Steve's question: Thanks for the helpful posting.... I just thought I would point out one detail that got buried near the bottom of your posting: Daryl> Thus, peak performance for the DAXPY loop on the 66MHz Snakes Daryl> box is: 66 million instructions per second * (2 flops / 5 instructions) Daryl> = 26.4 MFLOPS Daryl> This (compiled code) result(s) in performance rating of 66 * Daryl> (8/ 22) = 24 MFLOPS in the inner loop. Thus at the present Daryl> time, the compilers are achieving about 90% of the peak Daryl> performance potential on this particular loop Daryl> (assuming all data fits in the cache.) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is, of course, a big assumption! It is true that the HP caches are rather large, but the 256kB data cache will not even hold one 200x200 doubleprecision matrix. The memory bandwidth of the machine limits the long-vector DAXPY performance to under 22 MFLOPS. How well does the Fortran-produced code perform for long vectors? (My estimate would be in the range of 18-20 MFLOPS). It is also important to remember that the 66 MHz machines are not the $12000 machines. For these 50 MHz boxes, the numbers scale down by a factor of 0.76, giving long-vector DAXPY performance of about 14-15 MFLOPS. For comparison, the IBM RS/6000 machines run uncached DAXPY at: Model 320 6.25 MFLOPS (measured) 13.3 Theoretical Model 320H 7.81 MFLOPS (est) 16.7 Theoretical Model 530 10.53 MFLOPS (observed) 16.7 Theoretical Model 540 12.64 MFLOPS (est) 20.0 Theoretical Model 550 17.26 MFLOPS (est) 27.3 Theoretical The 320H is slower than the 530 even though they both have the same clock speed (25 MHz) because the 530 has a wider bus (128-bit vs 64-bit), larger cache line size (128 bytes vs 64 bytes), and a larger data cache (64kB vs 32kB). Perhaps part of the reason that the IBM performance is so much less than the memory-bandwidth-limited performance (labeled "Theoretical" above) is that stores cannot overlap with reads or computations.... -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET