Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!mips!zaphod.mps.ohio-state.edu!wuarchive!udel!brahms.udel.edu!mccalpin From: mccalpin@brahms.udel.edu (John D McCalpin) Newsgroups: comp.benchmarks Subject: Re: A question about flat out Snake speed. Summary: improvements to my previous estimates Message-ID: <20459@brahms.udel.edu> Date: 15 Apr 91 19:53:53 GMT References: <7834@idunno.Princeton.EDU> <45780003@hpcupt3.cup.hp.com> Organization: College of Marine Studies, U. Del. Lines: 64 In article I wrote: >The memory bandwidth of the machine limits the long-vector DAXPY >performance to under 22 MFLOPS. How well does the Fortran-produced >code perform for long vectors? (My estimate would be in the range of >18-20 MFLOPS). I realized that this number is way too high. Looking at reasonable values for the cache miss latency leads me to estimate the long DAXPY performance of the 66 MHz machines as about 9 MFLOPS, or under 1/2 of the performance based on the "peak" memory bandwidth. >For comparison, the IBM RS/6000 machines run uncached DAXPY at: I have revised these numbers taking into account the type of memory access pattern and cache miss pattern exhibited by long DAXPY's. These new results are shown below: Model Measured Estimated Theory Measured/ MFLOPS MFLOPS MFLOPS Theory ------------------------------------------------------------------ 320 6.25 6.67 93.7% 320H ---- 7.81 8.33 530 10.53 11.11 94.8% 540 ---- 12.64 13.33 550 ---- 17.26 18.22 ------------------------------------------------------------------ HP 720 ??? 6.67 HP 750 ??? 8.80 ------------------------------------------------------------------ The "sustained" MFLOPS of the machine is now modelled by: N words MFLOPS = (peak bandwidth) * ----------------------------- / (12 words/op) (N cycles + 8 cycles latency) where N=8 for the 320, 320H, 520, and N=16 for the 530, 540, 500. Note that this means that the first three machines can only sustain 1/2 of the advertised "peak" memory bandwidth, while the last three can sustain 2/3 of the "peak" memory bandwidth. For the HP machines, I can only guess, but "reasonable" guesses give an estimate like: N words MFLOPS = (peak bandwidth) * ------------------------------ / (12 words/op) (N cycles + 12 cycles latency) I have HP glossies that tell me that the cache line size (N) on the 720 is 8 words. I estimate a larger latency because the HP does not have the "critical-word-first" cache refill hardware that IBM does. So if the memory latency is the same for the first word, then the HP will wait (on the average) another 4 cycles before it gets the one it was waiting for. I have no info on the cache line size for the 750. Anybody from HP want to correct my errors on the latencies and provide some numbers for long DAXPY operations to see how well the compiler manages?