Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!elroy.jpl.nasa.gov!decwrl!sgi!tarolli@westcoast.esd.sgi.com
From: tarolli@westcoast.esd.sgi.com (Gary Tarolli)
Newsgroups: comp.sys.sgi
Subject: Re: SGI GL matrix performance
Summary: software matrix mult perf.
Message-ID: <100182@sgi.sgi.com>
Date: 29 Apr 91 16:16:03 GMT
References: <15407@helios.TAMU.EDU>
Sender: guest@sgi.sgi.com
Distribution: usa
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 50


In article <15407@helios.TAMU.EDU>, jamie@archone.tamu.edu (James Price) writes:
> Has anyone done any benchmarking of the SGI matrix functions?  I was curious
> and wrote the program included below.  It does a number of 4x4 matrix 
> multiplies, first using software, and then using the geometry pipeline 
> functions (loadmatrix(), multmatrix(), getmatrix()).  
> 
> Here are some typical results:
> 
> 10000 iterations on fritz, with GL version: GL4DGT-3.3
> 
> Software - no optimization:     3.349 sec.
> 
> Software - some optimization:   1.130 sec.
> 
> Software - more optimization:   0.910 sec.
> 
> Hardware - preserve CTM:        2.379 sec.
> 
> Hardware - destroy CTM:         2.289 sec.
> 
> Hardware - abandon results:     0.580 sec.
> 
> 
> The actual hardware multiplication is fast (0.580 sec/10000 multiplies) 
> but if we call getmatrix() to access the results, it slows things down 
> by around 400% (to 2.379 sec/10000 multiplies).  I was hoping to use the 
> speed of the hardware for my own matrix needs, but it looks like the 
> getmatrix() call is simply too costly.  Is there a better way?


Its possible to do a complete 4x4 matrix multiply in under 310 cycles on
a MIPS processor (in single precision).  At 33 Mhz this works out to over
100,000 matrix multiplies per second or .010 sec for your benchmark above,
more than 5 times faster than the hardware!

I think one of the reasons why your software benchmark ran so slow was
that you might have forgotten to compile with -float (and thus all floating
point math was done in double precision).

The theoretical limit for matrix multiply would be 64*4 cycles + a few.
Of course, this requires writing very careful assembler code in order
to overlap all the adds and load/stores with the 4 cycle multiplies.
So I suspect that you could improve upon the 310 number I actually
measured by about 10%.


--------------------
	Gary Tarolli