Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!noose.ecn.purdue.edu!ee.ecn.purdue.edu!hankd From: hankd@ee.ecn.purdue.edu (Hank Dietz) Newsgroups: comp.arch Subject: Superlinear Speedup (was Re: Scalability?) Summary: Yes, it can happen -- if you change the program Message-ID: <1990May3.203405.23456@ecn.purdue.edu> Date: 3 May 90 20:34:05 GMT References: <2075@naucse.UUCP> <6897@odin.corp.sgi.com> <49622@lanl.gov> <1990May1.154558.24009@cs.rochester.edu> Sender: news@ecn.purdue.edu (USENET news) Organization: Purdue University Engineering Computer Network Lines: 64 In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes: >In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes: >>I've got a glossy from SGI that shows the POWER center performance on >>LINPACK (100x100, coded) as: >> 1 CPU 3.8 DP MFLOPS >> 4 CPU 16 DP MFLOPS >> 8 CPU 28 DP MFLOPS >>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? > >A common trap is measuring parallel speedup is to use a "loaded" CPU. For >instance, the single processor might have been serving interrupts, clock >icons, etc. The result is that the performance for the single processor case >is artificially low. After taking care, the numbers usually turn out slightly >less than linear. The second trap is measuring with a relatively coarse >clock. For speedup that is near linear, barely missing a clock tick can lead >to measurements that indicate superlinear speedup. ^^^^^^^^^^^ Not really. Superlinear means better than O(n) for n processors... hence, even though the case of n=1 might be a bit strange, even k*n| k>1 isn't superlinear. For example, HJ Siegel has pointed out that for appropriate pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can achieve an apparent speedup of 2*n... but that is O(n), i.e., linear. I do research in optimizing/parallelizing compilers and the first benchmarks I did all yielded >n speedup. Why? BECAUSE WHEN A PROGRAM IS PARALLELIZED, IT ISN'T THE SAME PROGRAM -- it simply produces the same results. In other words, it can be transformed to take advantage of various side benefits of parallel hardware & execution: [1] Increased register, cache, and page frame space: if 1 processor can keep K things in registers, N processors might keep K*N in registers. The program is different because it doesn't spill registers, etc. [2] Different management of cache and page frames: for example, using the LRU policy, having N separate K-element LRU queues can yield better performance than one N*K-element queue. We tend to forget that even if the same space is available, a management policy change can dramatically alter performance. Proof left to the reader. ;-) [3] Introduction of constants and other code simplifications: for example, with N processors a loop involving a[i] might not need the usual indexing operation, because for each processor i is a compile-time constant. This is rather like unraveling every loop, but unraveling loops can have negative impacts on the memory reference pattern which might not surface relative to a parallel machine. This also happens in non-looping code. [4] In general, parallelization transformations restructure code so that locality is enhanced, but only in true parallel execution. This is due to the fact that live ranges overlapping in time on a parallel machine does not imply that they interfere. This is also hard to explain in detail, so I guess it must be intuitively obvious. ;-) It has to do primarily with the effects of asynchrony.... etc. Note that, for example, the benfits due to, for example, [1] and [3] do not interfere -- hence their benefits can actually multiply. In summary, you can get greater than O(n) speedup -- true superlinear speedup -- because of such combined effects. My claim is, however, that you really are not comparing equivalent programs. -hankd@ecn.purdue.edu