Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!mailrus!uflorida!gatech!hubcap!Steve From: mcvax!cs.exeter.ac.uk!sru@uunet.UU.NET (Steve Rush) Newsgroups: comp.parallel Subject: Re: Superlinear(?) speedups Message-ID: <3881@hubcap.UUCP> Date: 14 Dec 88 13:54:22 GMT Sender: fpst@hubcap.UUCP Lines: 48 Approved: parallel@hubcap.clemson.edu There has been a lot of discussion recently about superlinear speedup so I thought I would add my personal experiences eg. In article <3756@hubcap.UUCP>, kale@m.cs.uiuc.edu (L. V. Kale') writes: > Here is one more way / reason that leads to the famed superlinear speedups. The Background : I am doing work on graphics algorithms using Transputers, and one of my algorithms is for rotating images which are stored using 'stripcodes' ('objects' are defined using the coordinates of the left end, along with a colour and a length). The algorithm rotates this about a user specified point and then regenerates the new strips. To parallelize this, the screen is split into horizontal bands with one band for each processor. For efficiency, each processor has an array of linked lists, one for each scanline of the image. Results : This program took more than 4 times as long to run on one processor as it did on 4 processors (that is the definition of superlinear speedup isn't it?). The reason for this (I had to explain it to my supervisor somehow) is that the newly generated strips had to be added to the correct linked list, and merged with the strips already there. In the single processor system this required searching through the whole list, while in the 4 processor system it had a smaller list to search. The time spent at the end of the algorithm, sending the strips back to the correct processors plus the time to merge the strips came out less than the time that was saved by inserting the strips in the shorter lists. Although you can always claim that my single processor version was not efficient, the principle still holds (and I claim that the program WAS efficient for a single processor). (The only complaints I can think of are:- "a linked list is not efficient". Well, I can't think of anything else that is better, that wouldn't still show the behaviour I have described. OR "You could have the 4 linked lists per scanline on the single processor". This is cheating by running different algorithms on the two systems, in this case I should have 4 lists per scanline on every processor, which takes you back to the case above.) Has anyone got any other comments? Have I missed something obvious? I would be interrested if anyone has anything to say. (P.S. Sorry about the length of this posting). Steve R --- Steve Rush JANET: sru@uk.ac.exeter.cs Computer Science Dept. UUCP: ukc!expya!sru University of Exeter