Path: utzoo!attcan!uunet!brunix!sgf From: sgf@cs.brown.edu (Sam Fulcomer) Newsgroups: comp.sys.sgi Subject: Re: FFT's on 4D/2XX systems Message-ID: <29265@brunix.UUCP> Date: 14 Feb 90 16:56:04 GMT References: <9002122052.AA21651@snow-white.merit-tech.com> Sender: news@brunix.UUCP Reply-To: sgf@cs.brown.edu (Sam Fulcomer) Distribution: na Organization: Brown University Department of Computer Science Lines: 38 In article <9002122052.AA21651@snow-white.merit-tech.com> goss@SNOW-WHITE.MERIT-TECH.COM (Mike Goss) writes: >In reply to the message from Tom Reed: >> I'm looking for any FFT software that is available and runs on the >> 4D/2XX products. The faster the better especially if it is parallel code or > >The book "Numerical Recipes in C" (also available in FORTRAN and Pascal >versions) has several good FFT routines, although not in a parallelized form. Well, _Numerical_Recipes_ is ok, and I haven't bothered to try to p'ize the f77 codes yet, however it might be worthwhile (I haven't poked them much). It's quite possible that PFA won't like them much. Many numerical packages (IMSL in particular) aren't very adaptable to parallel arches. Another problem with all current (although NAG is working on it, as may be others) numerical packages is that they are not optimized for big-memory problems on cache machines (ie, as matrix size goes up data cache hits go down, as does performance). Algorithms optimized for processing address-regions of data in blocks are the solution to this problem (although monster data caches are another). The important thing to understand when trying to get performance out of a multi-proc SGI is to exactly typify the use which it's seeing when you want the performance. Parallelized code will run well (on a 4-proc system) if it is the only (or nearly only) thing running on the system. If you've got 2 of the beasts running you _may_ still be getting better than single proc performance, but don't bet on it. Don't even bother running if you don't have (effectively) 2 idle processors. I haven't bothered using the PFA since we typically have 2 or 3 things going on at any given time on our 4D/240GTX (64MB) with someone running 4Sight. My experience with it has been limited to bitching at people who've run multi-proc jobs on a busy system (and helping them PFA their code). I am very pleased with the things performance on single proc jobs, though. On an idle system the machine will run 4 copies of the same computation in the same time that only one takes (wall clock). A one-processor job (heavy FPU) seems to take about 2-3 times as much CPU time as on a 3090 with vector proc (the program vectorized on the 3090).