Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!mailrus!uflorida!gatech!hubcap!Steve
From: mcvax!cs.exeter.ac.uk!sru@uunet.UU.NET (Steve Rush)
Newsgroups: comp.parallel
Subject: Re: Superlinear(?) speedups
Message-ID: <3881@hubcap.UUCP>
Date: 14 Dec 88 13:54:22 GMT
Sender: fpst@hubcap.UUCP
Lines: 48
Approved: parallel@hubcap.clemson.edu


  There has been a lot of discussion recently about superlinear speedup so
I thought I would add my personal experiences eg.
In article <3756@hubcap.UUCP>, kale@m.cs.uiuc.edu (L. V. Kale') writes:
> Here is one more way / reason that leads to the famed superlinear speedups.

The Background :
  I am doing work on graphics algorithms using Transputers, and one of my
algorithms is for rotating images which are stored using 'stripcodes'
('objects' are defined using the coordinates of the left end, along with a
colour and a length). The algorithm rotates this about a user specified
point and then regenerates the new strips.
  To parallelize this, the screen is split into horizontal bands with one
band for each processor. For efficiency, each processor has an array of
linked lists, one for each scanline of the image.

Results :
  This program took more than 4 times as long to run on one processor as it
did on 4 processors (that is the definition of superlinear speedup isn't it?). 
  The reason for this (I had to explain it to my supervisor somehow) is
that the newly generated strips had to be added to the correct linked list,
and merged with the strips already there. In the single processor system
this required searching through the whole list, while in the 4 processor
system it had a smaller list to search.
  The time spent at the end of the algorithm, sending the strips back to
the correct processors plus the time to merge the strips came out less than
the time that was saved by inserting the strips in the shorter lists.

  Although you can always claim that my single processor version was not
efficient, the principle still holds (and I claim that the program WAS
efficient for a single processor). (The only complaints I can think of are:-
"a linked list is not efficient". Well, I can't think of anything else
that is better, that wouldn't still show the behaviour I have described. 
OR "You could have the 4 linked lists per scanline on the single
processor". This is cheating by running different algorithms on the two
systems, in this case I should have 4 lists per scanline on every
processor, which takes you back to the case above.)

  Has anyone got any other comments? Have I missed something obvious? 
I would be interrested if anyone has anything to say. (P.S. Sorry about the 
length of this posting).

                                             Steve R

---
Steve Rush			JANET: sru@uk.ac.exeter.cs 
Computer Science Dept.		UUCP:  ukc!expya!sru
University of Exeter