Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!unmvax!gatech!gitpyr!loligo!mccalpin
From: mccalpin@loligo (John McCalpin)
Newsgroups: comp.lang.fortran
Subject: Re: Are vendors implementing BLAS?
Summary: BLAS may still be helpful
Message-ID: <7626@pyr.gatech.EDU>
Date: 17 Mar 89 17:02:34 GMT
References: <449@orange19.qtp.ufl.edu>
Sender: news@pyr.gatech.EDU
Reply-To: mccalpin@loligo.cc.fsu.edu (John McCalpin)
Distribution: na
Organization: Supercomputer Computations Research Institute
Lines: 41

In response to the following:

>In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
>> Is anyone out there aware of other vendors implementing the BLAS for
>> their machines? David Bernholdt	bernhold@qtp.ufl.edu

someone from MIPS made the comment that the compilers should be
generating near-optimal code anyway, and said that the MIPS LINPACK
performance was within 10% of optimal for the compiler-generated code.

This did not seem to agree with my recollection, so here are the published
LINPACK results from the January 29 LINPACK summary:

Machine		  Test	     Fortran/compiler  Coded/compiler %Speedup
------------------------------------------------------------------------
M-2000  25.0 MHz 64-bit		3.6  (????)	4.0  (????)	11%
M-120/5	16.7 MHz   " 		2.1  (1.30)	2.2  (1.31)	 5%
M-1000  15.0 MHz   " 		1.5  (1.30)	1.6  (1.21)	 7%
M-800   12.5 MHz   "		1.2  (1.30)	1.1  (1.10)	-9% ***
------------------------------------------------------------------------
M-2000		 32-bit		5.7  (????)	7.2  (????)	26%
M-120/5		   "		3.9  (1.31)	4.8  (1.31)	23%
M-1000		   "		3.6  (1.30)	4.3  (1.21)	19%
M-800		   "		3.0  (1.30)	2.4  (1.10)    -18% ***
------------------------------------------------------------------------
*** In these cases, the coded results used an old (1.10) compiler, and
    so are not competitive.

The results are within 10% for the 64-bit results, but the 32-bit code
clearly benefits from hand-optimization.

On a Silicon Graphics Personal IRIS (which should have the same CPU and
clock as the M-800 and which uses level 1.31 of the compiler), I have
not been able to exceed 1.96 MFLOPS for the 32-bit all Fortran code,
using full (-O3) optimization and a variety of loop unrolling lengths
(1-32). I can't yet account for this discrepency --- anyone want to 
volunteer to explain it?
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------