Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!sgi!shinobu!odin!patton.wpd.sgi.com!jmb From: jmb@patton.wpd.sgi.com (Jim Barton) Newsgroups: comp.sys.sgi Subject: Re: SGI GL matrix performance Message-ID: <1991May2.010302.9633@odin.corp.sgi.com> Date: 2 May 91 01:03:02 GMT References: <15407@helios.TAMU.EDU> <1290@voodoo.UUCP> Sender: news@odin.corp.sgi.com (Net News) Distribution: usa Organization: Silicon Graphics Inc. Lines: 38 In all cases you must run the benchmark and average the results to get a true performance number. The reasons are many and varied, but some of the more significant ones are: 1) When you first run a program, it takes awhile to fill up the processor cache. Depending on context switching, etc., the cache can be more or less effective at various times during the run. 2) When you first execute a program, IRIX must read it from disk. However, IRIX is fanatical about caching disk blocks in memory, and it is quite likely that the second execution just picks up the pages in memory, and execution time could be significantly faster. This happens even when the timing is built into the program, since executables are almost always demand paged. 3) The way in which real memory pages are allocated to the process has a big impact on performance because the processor caches are direct mapped. For example, on a system with a 64Kb cache, real memory references modulo 64Kb will map to the same cache location. IRIX tries its best to allocate physical memory in a linear fashion, so that the probability of cache thrashing is minimized, but in the final analysis the application memory access pattern will determine the performance. 4) The 4D/20 and 4D/25 have a 1-deep write buffer. By default, C does all floating point in double precision (two words). Thus, when the compiler writes out a double precision float, the first word is buffered, but the second stalls the processor until the first write has been retired. Single precision floats (-float flag to the compiler) will eliminate this problem (unless you really need double precision). The POWERSeries machines have a 4-deep write buffer, while the 4D35 has an 8 deep write buffer. Benchmarking is Art, not Science. I suspect it always will be, despite the best efforts of SPEC, etc. -- Jim Barton Silicon Graphics Computer Systems jmb@sgi.com