Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: A question about flat out Snake speed. Message-ID: Date: 3 Apr 91 19:27:04 GMT References: <1991Apr2.120117.21406@nas.nasa.gov> <7834@idunno.Princeton.EDU> Sender: usenet@ee.udel.edu Followup-To: comp.benchmarks Organization: College of Marine Studies, U. Del. Lines: 61 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: ssr@stokes.Princeton.EDU's message of 3 Apr 91 14:34:49 GMT >>>>> On 3 Apr 91 14:34:49 GMT, ssr@stokes.Princeton.EDU (Steve S. Roy) said: Steve> With all the discussion of the speed of HP's hot new Snake Steve> systems, I've been wondering what their peak speeds are. Steve> Suppose you hand coded a matrix multiply, trig function, FFT or Steve> whatever. What is the maximum speed you could get and what Steve> would the limiting factors be? Is a daxpy ( x = a*x+y ) Steve> limited by the FPU or by cache or by main memory? How flexible Steve> is the multiply accumulate? Can the integer and fp units run Steve> in parallel as on the i860? (1) It looks like the "Peak Speed" of the Snakes will be 50 MFLOPS for the 720 and 66 MFLOPS for the 730. This is based on the comment from someone at HP that the adder and multiplier could each accept a new pair of operands every other clock. Since they are fully pipelined and can run simultaneously, this gives a peak MFLOPS = MHz. Note that this is different than the IBM RS/6000 whose adder/multiplier can accept operands every clock so that add/multiply peak MFLOPS = 2*MHz. (2) Except for specially coded, cache-friendly stuff like Matrix Multiply and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most calculations will be limited by the memory bandwidth. The formula for 64-bit vector dyad operations (e.g. DSCAL) is Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP) The HP has a 32-bit memory bus capable of 1 word/clock transfer rates, so the "Streaming MFLOPS" for these machines is: HP/9000 Models: 720 = 200/24 = 8.3 MFLOPS 730 = 264/24 = 11.0 MFLOPS The formula for the IBM RS/6000 is slightly different since the bottleneck is between cache and the FPU, not between main memory and cache. The "Streaming MFLOPS" for these machines are IBM RS/6000 Models: 320,520 = 160/24 = 6.8 MFLOPS 530,730,930 = 200/24 = 8.3 MFLOPS 540 = 240/24 = 10.0 MFLOPS 550 = 328/24 = 13.7 MFLOPS The corresponding numbers for triadic vector operations like DAXPY are exactly twice these estimates. (3) How fast your code will run will depend to a great degree on how much re-use you get of cached data. Long streaming vector ops will run at the memory bandwidth-limited speed, while short (cacheable), reused vectors will run at closer to the "Peak Speed". -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET