Xref: utzoo comp.lang.fortran:5110 comp.unix.cray:294 comp.sys.super:314 Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!rice!ariel.rice.edu!preston From: preston@ariel.rice.edu (Preston Briggs) Newsgroups: comp.lang.fortran,comp.unix.cray,comp.sys.super Subject: Re: Fortran optimization - THE ANSWER! Summary: hmmm Message-ID: <1991Apr5.060803.17612@rice.edu> Date: 5 Apr 91 06:08:03 GMT References: <1991Apr5.032552.12817@eagle.lerc.nasa.gov> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 169 fsjohnv@alliant1.lerc.nasa.gov (Richard Mulac) writes: >The contradiction that I found in my >real code can be described using the following program TEST, which calls >subroutines JOINED and SPLIT. > > PROGRAM TEST > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > CALL JOINED > CALL SPLIT > STOP > END > SUBROUTINE JOINED > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > DO N=1,NX > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > ENDDO > RETURN > END > SUBROUTINE SPLIT > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > DO N=1,NX > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > ENDDO > RETURN > END > > The code was compiled with the -o aggress option which vectorized the >inner loops in both JOINED and SPLIT. It was expected that JOINED would run >faster and more efficiently than SPLIT, but the opposite result occurred. >Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8 >seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate >of 180 Mflops. The results from this test program were similiar to those >I found when combining 10 to 50 line loops with complicated right hand sides >in the original code I was working on. > Several people analyzed the program and found the problem to be due to the >large amount of address locations which are computed and saved for later use >within each loop (i.e., JOINED). Mulac concludes that the code is poorly written and there's is no compiler bug. I agree that the code is poorly written, but I don't believe the cause is addressing expressions. Since A, B, and C all have the same dimensions, we should be able to handle all the addressing in the inner loops with a single index register and constant offsets. All the addressing expressions should be of the form I*8 + A + 0*100 I*8 + B + 0*100 I*8 + C + 0*100 I*8 + A + 1*100 I*8 + B + 1*100 I*8 + C + 1*100 ... ... ... Since I is always multiplied by 8, it should be strength reduced. The rest of each addressing expression is constant. So, index + constant. If the code is indeed suffering from too many addressing expressions, then Cray should look to their strength reduction and consider reassociation of loop-invariants. There are alternative explanations. You're addressing A, B, and C with the 2nd index varying (running from 1 to 100). On a cache machine, this is a bad move. I dunno about Crays. Certainly you're causing non-unit stride accesses. Further, if the compiler detects that A(I, 1) is unchanged across the 99 intervening assignments, it may be trying to hold 100 FP values in registers, optimistically awaiting their reuse. I would propose an alternative implementation DO N=1, NX DO J = 1, JX DO I=1, IX A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) ENDDO ENDDO ENDDO We would hope the compiler will achieve the code below DO N=1, NX DO J = 1, JX DO I=1, IX aij = A(I, J) bij = B(I, J) cij = C(I, J) aij = aij * bij + cij aij = aij * bij + cij A(I, J) = aij ENDDO ENDDO ENDDO If you must unroll, unroll the outermost loop, giving DO N=1, NX, 4 DO J = 1, JX DO I=1, IX A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) ENDDO ENDDO ENDDO which the compiler will probably turn into DO N=1, NX, 4 DO J = 1, JX DO I=1, IX aij = A(I, J) bij = B(I, J) cij = C(I, J) aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij A(I, J) = aij ENDDO ENDDO ENDDO Be aware though, that unrolling the outer loop and jamming the bodies isn't always legal or the right thing to do. Preston Briggs