Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!noose.ecn.purdue.edu!ee.ecn.purdue.edu!hankd
From: hankd@ee.ecn.purdue.edu (Hank Dietz)
Newsgroups: comp.arch
Subject: Superlinear Speedup (was Re: Scalability?)
Summary: Yes, it can happen -- if you change the program
Message-ID: <1990May3.203405.23456@ecn.purdue.edu>
Date: 3 May 90 20:34:05 GMT
References: <2075@naucse.UUCP> <6897@odin.corp.sgi.com> <49622@lanl.gov> <1990May1.154558.24009@cs.rochester.edu>
Sender: news@ecn.purdue.edu (USENET news)
Organization: Purdue University Engineering Computer Network
Lines: 64

In article <1990May1.154558.24009@cs.rochester.edu> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>In article <49622@lanl.gov> ryg@lanl.gov (Richard S Grandy) writes:
>>I've got a glossy from SGI that shows the POWER center performance on 
>>LINPACK (100x100, coded) as:
>>		1 CPU	3.8 DP MFLOPS
>>		4 CPU	16  DP MFLOPS
>>		8 CPU	28  DP MFLOPS
>>Does this mean than with 4 cpus you really get GREATER than a linear speedup?? 
>
>A common trap is measuring parallel speedup is to use a "loaded" CPU.  For
>instance, the single processor might have been serving interrupts, clock
>icons, etc.  The result is that the performance for the single processor case
>is artificially low.  After taking care, the numbers usually turn out slightly
>less than linear.  The second trap is measuring with a relatively coarse
>clock.  For speedup that is near linear, barely missing a clock tick can lead
>to measurements that indicate superlinear speedup.
			       ^^^^^^^^^^^

Not really.  Superlinear means better than O(n) for n processors...  hence,
even though the case of n=1 might be a bit strange, even k*n| k>1 isn't
superlinear.  For example, HJ Siegel has pointed out that for appropriate
pipelined SIMD machines (read: PASM) the overlap of the CU and PEs can
achieve an apparent speedup of 2*n...  but that is O(n), i.e., linear.

I do research in optimizing/parallelizing compilers and the first benchmarks
I did all yielded >n speedup.  Why?  BECAUSE WHEN A PROGRAM IS PARALLELIZED,
IT ISN'T THE SAME PROGRAM -- it simply produces the same results.  In other
words, it can be transformed to take advantage of various side benefits of
parallel hardware & execution:

[1]	Increased register, cache, and page frame space:  if 1 processor can
	keep K things in registers, N processors might keep K*N in registers.
	The program is different because it doesn't spill registers, etc.

[2]	Different management of cache and page frames:  for example, using
	the LRU policy, having N separate K-element LRU queues can yield
	better performance than one N*K-element queue.  We tend to forget
	that even if the same space is available, a management policy change
	can dramatically alter performance.  Proof left to the reader.  ;-)

[3]	Introduction of constants and other code simplifications:  for
	example, with N processors a loop involving a[i] might not need the
	usual indexing operation, because for each processor i is a
	compile-time constant.  This is rather like unraveling every loop,
	but unraveling loops can have negative impacts on the memory
	reference pattern which might not surface relative to a parallel
	machine.  This also happens in non-looping code.

[4]	In general, parallelization transformations restructure code so that
	locality is enhanced, but only in true parallel execution.  This is
	due to the fact that live ranges overlapping in time on a parallel
	machine does not imply that they interfere.  This is also hard to
	explain in detail, so I guess it must be intuitively obvious.  ;-)
	It has to do primarily with the effects of asynchrony....

etc.

Note that, for example, the benfits due to, for example, [1] and [3] do not
interfere -- hence their benefits can actually multiply.  In summary, you
can get greater than O(n) speedup -- true superlinear speedup -- because of
such combined effects.  My claim is, however, that you really are not
comparing equivalent programs.

						-hankd@ecn.purdue.edu