Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!elroy.jpl.nasa.gov!decwrl!sgi!tarolli@westcoast.esd.sgi.com From: tarolli@westcoast.esd.sgi.com (Gary Tarolli) Newsgroups: comp.sys.sgi Subject: Re: SGI GL matrix performance Summary: software matrix mult perf. Message-ID: <100182@sgi.sgi.com> Date: 29 Apr 91 16:16:03 GMT References: <15407@helios.TAMU.EDU> Sender: guest@sgi.sgi.com Distribution: usa Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 50 In article <15407@helios.TAMU.EDU>, jamie@archone.tamu.edu (James Price) writes: > Has anyone done any benchmarking of the SGI matrix functions? I was curious > and wrote the program included below. It does a number of 4x4 matrix > multiplies, first using software, and then using the geometry pipeline > functions (loadmatrix(), multmatrix(), getmatrix()). > > Here are some typical results: > > 10000 iterations on fritz, with GL version: GL4DGT-3.3 > > Software - no optimization: 3.349 sec. > > Software - some optimization: 1.130 sec. > > Software - more optimization: 0.910 sec. > > Hardware - preserve CTM: 2.379 sec. > > Hardware - destroy CTM: 2.289 sec. > > Hardware - abandon results: 0.580 sec. > > > The actual hardware multiplication is fast (0.580 sec/10000 multiplies) > but if we call getmatrix() to access the results, it slows things down > by around 400% (to 2.379 sec/10000 multiplies). I was hoping to use the > speed of the hardware for my own matrix needs, but it looks like the > getmatrix() call is simply too costly. Is there a better way? Its possible to do a complete 4x4 matrix multiply in under 310 cycles on a MIPS processor (in single precision). At 33 Mhz this works out to over 100,000 matrix multiplies per second or .010 sec for your benchmark above, more than 5 times faster than the hardware! I think one of the reasons why your software benchmark ran so slow was that you might have forgotten to compile with -float (and thus all floating point math was done in double precision). The theoretical limit for matrix multiply would be 64*4 cycles + a few. Of course, this requires writing very careful assembler code in order to overlap all the adds and load/stores with the 4 cycle multiplies. So I suspect that you could improve upon the 310 number I actually measured by about 10%. -------------------- Gary Tarolli