Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ncar!ames!sgi!archer@elysium.SGI.COM From: archer@elysium.SGI.COM (Archer Sully) Newsgroups: comp.sys.sgi Subject: Re: (none) Message-ID: <32276@sgi.SGI.COM> Date: 8 May 89 16:47:27 GMT References: <890505182901.2cc14a47@SCRI1.SCRI.FSU.EDU> Sender: daemon@sgi.SGI.COM Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 48 In article <890505182901.2cc14a47@SCRI1.SCRI.FSU.EDU>, MCCALPIN@SCRI1.SCRI.FSU.EDU writes: > We have a Power Series 120 machine here for a demo/loan, and > I have had trouble getting anything useful out of the Power > Fortran preprocessor. The code is basically a whole bunch of double > DO loops, with an iteration count of about 40 on the outer loop and > 100 on the inner loops. > > I ran the code with the following command: > f77 -O2 test.f -o test > time test > and it took 16.3 seconds > > I re-ran it with: f77 -pfa keep -O2 test.f -o test > and it took 26.4 seconds! > > I ran it again with: f77 -pfa keep -WK,-O=4,-UR=4 -O2 test.f -o test > and it got back down to 16.2 seconds. > > This code seems ideal for loop-splitting parallelization, and the > intermediate code files show DOACROSS directives on all the important > loops. > > Anybody have any ideas of something I might be doing wrong? > One thing that comes to mind is that you might have some initialization loops, etc... that are parallelized, but don't have enough work in each chunk to justify the synchronization overhead. If you haven't already, profile the single processor version (using both pc-sampling [-p] and pixie) and compare the results to the intermediate files generated by pfa. Look for loops that use 1% (or less) of the execution time of the program being parallelized. The next trick is to remove the unwanted doacross's from the .m and file and rename it as a .f and recompile like so f77 -mp -nocpp foo.f -O2 -o foo to generate a new parallelized executable. Hope this helps, Archer Sully archer@sgi.com "life is short, and full of stuff" -- Lux Interior --