Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!wuarchive!zaphod.mps.ohio-state.edu!sdd.hp.com!elroy.jpl.nasa.gov!ames!sgi!shinobu!odin!pipo.corp.sgi.com!jpp
From: jpp@pipo.corp.sgi.com (Jean-Pierre Panziera)
Newsgroups: comp.sys.sgi
Subject: Re: Basic Linear Algebra Subroutines (BLAS)
Message-ID: <10365@odin.corp.sgi.com>
Date: 14 Jul 90 00:05:56 GMT
References: <90Jul13.100737edt.8304@ephemeral.ai.toronto.edu>
Sender: news@odin.corp.sgi.com
Reply-To: jpp@corp.sgi.com
Organization: Silicon Graphics, Applications Product Division
Lines: 33

In article <90Jul13.100737edt.8304@ephemeral.ai.toronto.edu>,
tff@na.toronto.edu (Tom Fairgrieve) writes:
> From: tff@na.toronto.edu (Tom Fairgrieve)
> Subject: Basic Linear Algebra Subroutines (BLAS)
> Date: 13 Jul 90 14:08:02 GMT
> Organization: Department of Computer Science, University of Toronto
> 
> Does SGI have an optimized version of the BLAS (Basic Linear Algebra 
> Subroutines) available for the 4d/240?  If so, how does the performance
> of this version compare to a version produced by the f77 compiler with
> -O3 optimization level set?  I'm interested in all 3 levels of the BLAS.
> 
> Thanks for any information,
>   Tom Fairgrieve
>   tff@na.utoronto.ca


As far as I know SGI does not have an official version of BLAS3,
I may be wrong.

However I have optimized/parallelized a Fortran version of
the matrix multiplication routines of  Blas3 

I get pretty good results on a 220-GTX :

dgemm 5-11 Mflops
zgemm 10-14 Mflops
sgemm 10-16 Mflops
cgemm 12-17 Mflops

the lowest performances are for  A * trans(B), the highest for trans(A) * B

I am sure it can be improved and I do not warranty it is bug free.