Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!sgi!shinobu!odin!patton.wpd.sgi.com!jmb
From: jmb@patton.wpd.sgi.com (Jim Barton)
Newsgroups: comp.sys.sgi
Subject: Re: SGI GL matrix performance
Message-ID: <1991May2.010302.9633@odin.corp.sgi.com>
Date: 2 May 91 01:03:02 GMT
References: <15407@helios.TAMU.EDU> <1290@voodoo.UUCP>
Sender: news@odin.corp.sgi.com (Net News)
Distribution: usa
Organization: Silicon Graphics Inc.
Lines: 38

In all cases you must run the benchmark and average the results to get a true
performance number. The reasons are many and varied, but some of the more
significant ones are:

   1) When you first run a program, it takes awhile to fill up the processor
      cache. Depending on context switching, etc., the cache can be more or
      less effective at various times during the run.

   2) When you first execute a program, IRIX must read it from disk. However,
      IRIX is fanatical about caching disk blocks in memory, and it is quite
      likely that the second execution just picks up the pages in memory, and
      execution time could be significantly faster. This happens even when the
      timing is built into the program, since executables are almost always
      demand paged.

   3) The way in which real memory pages are allocated to the process has a big
      impact on performance because the processor caches are direct mapped.
      For example, on a system with a 64Kb cache, real memory references
      modulo 64Kb will map to the same cache location. IRIX tries its best to
      allocate physical memory in a linear fashion, so that the probability of
      cache thrashing is minimized, but in the final analysis the application
      memory access pattern will determine the performance.

   4) The 4D/20 and 4D/25 have a 1-deep write buffer. By default, C does all
      floating point in double precision (two words). Thus, when the compiler
      writes out a double precision float, the first word is buffered, but
      the second stalls the processor until the first write has been retired.
      Single precision floats (-float flag to the compiler) will eliminate this
      problem (unless you really need double precision). The POWERSeries
      machines have a 4-deep write buffer, while the 4D35 has an 8 deep write
      buffer.

Benchmarking is Art, not Science. I suspect it always will be, despite the
best efforts of SPEC, etc.

-- Jim Barton
   Silicon Graphics Computer Systems
   jmb@sgi.com