Xref: utzoo comp.lang.fortran:5111 comp.unix.cray:295 comp.sys.super:315 Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!bcm!rice!ariel.rice.edu!preston From: preston@ariel.rice.edu (Preston Briggs) Newsgroups: comp.lang.fortran,comp.unix.cray,comp.sys.super Subject: Re: Fortran optimization - THE ANSWER! Message-ID: <1991Apr5.062536.17948@rice.edu> Date: 5 Apr 91 06:25:36 GMT References: <1991Apr5.032552.12817@eagle.lerc.nasa.gov> <1991Apr5.060803.17612@rice.edu> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 44 I wrote >If you must unroll, unroll the outermost loop, giving > > DO N=1, NX, 4 > DO J = 1, JX > DO I=1, IX > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > ENDDO > ENDDO > ENDDO On further thought (!), I'd unroll the middle loop a little (use moderation in your experiments). Something like DO N=1, NX DO J = 1, JX, 4 DO I=1, IX A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0) A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0) A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1) A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1) A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2) A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2) A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3) A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3) ENDDO ENDDO ENDDO the idea being that the compiler would be better able to schedule this stuff. Instead of 1 expression, we now get 4 expressions that can be run in parallel, hopefully filling the pipe lines. Experiment a little with the amount of unrolling and see what happens. Preston Briggs