Xref: utzoo comp.lang.fortran:5111 comp.unix.cray:295 comp.sys.super:315
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!bcm!rice!ariel.rice.edu!preston
From: preston@ariel.rice.edu (Preston Briggs)
Newsgroups: comp.lang.fortran,comp.unix.cray,comp.sys.super
Subject: Re: Fortran optimization - THE ANSWER!
Message-ID: <1991Apr5.062536.17948@rice.edu>
Date: 5 Apr 91 06:25:36 GMT
References: <1991Apr5.032552.12817@eagle.lerc.nasa.gov> <1991Apr5.060803.17612@rice.edu>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 44

I wrote

>If you must unroll, unroll the outermost loop, giving
>
>	DO N=1, NX, 4
>	  DO J = 1, JX
>	    DO I=1, IX
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	    ENDDO
>	  ENDDO
>	ENDDO

On further thought (!), I'd unroll the middle loop a little
(use moderation in your experiments).  Something like

	DO N=1, NX
	  DO J = 1, JX, 4
	    DO I=1, IX
	      A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0)
	      A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0)
	      A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1)
	      A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1)
	      A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2)
	      A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2)
	      A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3)
	      A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3)
	    ENDDO
	  ENDDO
	ENDDO

the idea being that the compiler would be better able to schedule
this stuff.  Instead of 1 expression, we now get 4 expressions
that can be run in parallel, hopefully filling the pipe lines.

Experiment a little with the amount of unrolling and see what happens.

Preston Briggs