Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: A question about flat out Snake speed. Message-ID: Date: 4 Apr 91 14:31:07 GMT References: <1991Apr2.120117.21406@nas.nasa.gov> <7834@idunno.Princeton.EDU> Sender: usenet@ee.udel.edu Followup-To: comp.benchmarks Organization: College of Marine Studies, U. Del. Lines: 59 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: mccalpin@perelandra.cms.udel.edu's message of 3 Apr 91 19:27:04 GMT >>>>> On 3 Apr 91 19:27:04 GMT, (mccalpin@perelandra.cms.udel.edu) I wrote about the "Peak" vs "Streaming" speeds (MFLOPS) of the new HP Snakes and IBM RS/6000 computers. Thanks to jbs@watson.ibm.com for pointing out some errors that I would like to clear up below: ----------------------- Me> (2) Except for specially coded, cache-friendly stuff like Matrix Multiply Me> and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most Me> calculations will be limited by the memory bandwidth. The formula for Me> 64-bit vector dyad operations (e.g. DSCAL) is Me> Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP) This is almost correct. Just delete the word DSCAL and consider operations of the form: a(i) = b(i) op c(i) where op is one of +,-,*. This requires 2 8-byte reads and one 8-byte write per FP operation. This operation dominates most scientific codes (in my experience), and is therefore to be preferred over DAXPY for estimating machine speed. DAXPY requires the same amount of memory traffic, but squeezes in an extra FP op by using a loop-invariant scalar: y(i) = y(i) + scalar*x(i) DSCAL requires only one read per op, and so scales like: DSCAL MFLOPS = (MB/s)/(16 bytes/FLOP) ------------------------- Me> The formula for the IBM RS/6000 is slightly different since the Me> bottleneck is between cache and the FPU, not between main memory and Me> cache. The "Streaming MFLOPS" for these machines are To clarify: What I was trying to say was that the memory bandwidth to be used to calculate streaming MFLOPS on these machines is the bandwidth of 8 bytes/clock from the cache to the registers, NOT the bandwidth of 16 bytes/clock from the main memory to the cache. This extra bandwidth from main memory to cache is not wasted (since it decreases the cache refill rate considerably) but the machine is not capable of 128 byte/clock transfers to and from the FPU where the work is done. ------------------------- Finally: Me> IBM RS/6000 Models: Me> 320,520 = 160/24 = 6.8 MFLOPS ^^^ This should of course be 6.7 MFLOPS. -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET