Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!ncar!midway!quads.uchicago.edu!rtp1
From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert)
Newsgroups: comp.sys.apollo
Subject: Re:  Snakebytes (long -- and poisonous?)
Message-ID: <1991Mar27.220716.17323@midway.uchicago.edu>
Date: 27 Mar 91 22:07:16 GMT
References: <9103272114.AA31661@richter.mit.edu>
Sender: rtp1@midway.uchicago.edu (raymond thomas pierrehumbert)
Organization: University of Chicago
Lines: 20


On cache-dominated machines, the Linpak benchmarks (100x100 or300x300)
are not very good tests of realistic performance with any current
compilers I am aware of.  This is because many large problems, including
2D FFT's and large matrix multiplies, can be re-written using 
"strip mining" to maximize cache hits and re-use of data. On an IBM
R6000/730, you get only 2megaflops for a 1000x1000 compiled matrix
multiply, but with a simple modification of the loop to maximize
cache hits, you get about 40 megaflops (and this is on a 25Mhz machine).
Compilers aren't yet smart enough to do stripmining.

By the way, for the IBM 6000 series, I have been able to learn this
sort of stuff about how to get performance out of the thing.  On
my DN10000, I have to use the vector library and BLAS library to
get any performance;  with the machine-coded matrix multiply, 
I get about 25 megaflops, 1 processor.  With Fortran, the results
go down to about 2-3 megaflops.  I have no idea how to get
performance from inside fortran, despite having had the machine
for almost two years now.  You'd almost think HP/A considered 
performance tuning techniques a closely guarded secret!