Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!ncar!midway!quads.uchicago.edu!rtp1 From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) Newsgroups: comp.sys.apollo Subject: Re: Snakebytes (long -- and poisonous?) Message-ID: <1991Mar27.220716.17323@midway.uchicago.edu> Date: 27 Mar 91 22:07:16 GMT References: <9103272114.AA31661@richter.mit.edu> Sender: rtp1@midway.uchicago.edu (raymond thomas pierrehumbert) Organization: University of Chicago Lines: 20 On cache-dominated machines, the Linpak benchmarks (100x100 or300x300) are not very good tests of realistic performance with any current compilers I am aware of. This is because many large problems, including 2D FFT's and large matrix multiplies, can be re-written using "strip mining" to maximize cache hits and re-use of data. On an IBM R6000/730, you get only 2megaflops for a 1000x1000 compiled matrix multiply, but with a simple modification of the loop to maximize cache hits, you get about 40 megaflops (and this is on a 25Mhz machine). Compilers aren't yet smart enough to do stripmining. By the way, for the IBM 6000 series, I have been able to learn this sort of stuff about how to get performance out of the thing. On my DN10000, I have to use the vector library and BLAS library to get any performance; with the machine-coded matrix multiply, I get about 25 megaflops, 1 processor. With Fortran, the results go down to about 2-3 megaflops. I have no idea how to get performance from inside fortran, despite having had the machine for almost two years now. You'd almost think HP/A considered performance tuning techniques a closely guarded secret!