Path: utzoo!attcan!uunet!brunix!sgf
From: sgf@cs.brown.edu (Sam Fulcomer)
Newsgroups: comp.sys.sgi
Subject: Re:  FFT's on 4D/2XX systems
Message-ID: <29265@brunix.UUCP>
Date: 14 Feb 90 16:56:04 GMT
References: <9002122052.AA21651@snow-white.merit-tech.com>
Sender: news@brunix.UUCP
Reply-To: sgf@cs.brown.edu (Sam Fulcomer)
Distribution: na
Organization: Brown University Department of Computer Science
Lines: 38

In article <9002122052.AA21651@snow-white.merit-tech.com> goss@SNOW-WHITE.MERIT-TECH.COM (Mike Goss) writes:
>In reply to the message from Tom Reed:
>> I'm looking for any FFT software that is available and runs on the
>> 4D/2XX products. The faster the better especially if it is parallel code or
>
>The book "Numerical Recipes in C" (also available in FORTRAN and Pascal
>versions) has several good FFT routines, although not in a parallelized form.

Well, _Numerical_Recipes_ is ok, and I haven't bothered to try to p'ize the
f77 codes yet, however it might be worthwhile (I haven't poked them much).
It's quite possible that PFA won't like them much. Many numerical packages
(IMSL in particular) aren't very adaptable to parallel arches.

Another problem with all current (although NAG is working on it, as may be 
others) numerical packages is that they are not optimized for big-memory 
problems on cache machines (ie, as matrix size goes up data cache hits go
down, as does performance). Algorithms optimized for processing address-regions
of data in blocks are the solution to this problem (although monster data 
caches are another). 

The important thing to understand when trying to get performance out of a 
multi-proc SGI is to exactly typify the use which it's seeing when you want
the performance. Parallelized code will run well (on a 4-proc system) if it is 
the only (or nearly only) thing running on the system. If you've got 2 of the 
beasts running you _may_ still be getting better than single proc performance, 
but don't bet on it. Don't even bother running if you don't have (effectively)
2 idle processors. 

I haven't bothered using the PFA since we typically have 2 or 3 things going
on at any given time on our 4D/240GTX (64MB) with someone running 4Sight.
My experience with it has been limited to bitching at people who've run
multi-proc jobs on a busy system (and helping them PFA their code).

I am very pleased with the things performance on single proc jobs, though. On 
an idle system the machine will run 4 copies of the same computation in the 
same time that only one takes (wall clock). A one-processor job (heavy FPU)
seems to take about 2-3 times as much CPU time as on a 3090 with vector proc
(the program vectorized on the 3090).