Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!unmvax!gatech!gitpyr!loligo!mccalpin From: mccalpin@loligo (John McCalpin) Newsgroups: comp.lang.fortran Subject: Re: Are vendors implementing BLAS? Summary: BLAS may still be helpful Message-ID: <7626@pyr.gatech.EDU> Date: 17 Mar 89 17:02:34 GMT References: <449@orange19.qtp.ufl.edu> Sender: news@pyr.gatech.EDU Reply-To: mccalpin@loligo.cc.fsu.edu (John McCalpin) Distribution: na Organization: Supercomputer Computations Research Institute Lines: 41 In response to the following: >In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes: >> Is anyone out there aware of other vendors implementing the BLAS for >> their machines? David Bernholdt bernhold@qtp.ufl.edu someone from MIPS made the comment that the compilers should be generating near-optimal code anyway, and said that the MIPS LINPACK performance was within 10% of optimal for the compiler-generated code. This did not seem to agree with my recollection, so here are the published LINPACK results from the January 29 LINPACK summary: Machine Test Fortran/compiler Coded/compiler %Speedup ------------------------------------------------------------------------ M-2000 25.0 MHz 64-bit 3.6 (????) 4.0 (????) 11% M-120/5 16.7 MHz " 2.1 (1.30) 2.2 (1.31) 5% M-1000 15.0 MHz " 1.5 (1.30) 1.6 (1.21) 7% M-800 12.5 MHz " 1.2 (1.30) 1.1 (1.10) -9% *** ------------------------------------------------------------------------ M-2000 32-bit 5.7 (????) 7.2 (????) 26% M-120/5 " 3.9 (1.31) 4.8 (1.31) 23% M-1000 " 3.6 (1.30) 4.3 (1.21) 19% M-800 " 3.0 (1.30) 2.4 (1.10) -18% *** ------------------------------------------------------------------------ *** In these cases, the coded results used an old (1.10) compiler, and so are not competitive. The results are within 10% for the 64-bit results, but the 32-bit code clearly benefits from hand-optimization. On a Silicon Graphics Personal IRIS (which should have the same CPU and clock as the M-800 and which uses level 1.31 of the compiler), I have not been able to exceed 1.96 MFLOPS for the 32-bit all Fortran code, using full (-O3) optimization and a variety of loop unrolling lengths (1-32). I can't yet account for this discrepency --- anyone want to volunteer to explain it? ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------