Path: utzoo!utgpu!watserv1!watmath!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!usc!snorkelwacker.mit.edu!ai-lab!life!burley
From: burley@pogo.ai.mit.edu (Craig Burley)
Newsgroups: comp.lang.fortran
Subject: Re: strength reduction of array address calculations
Message-ID: <BURLEY.90Nov29103957@pogo.ai.mit.edu>
Date: 29 Nov 90 15:39:57 GMT
References: <BURLEY.90Nov28084920@pogo.ai.mit.edu>
	<1990Nov28.213735.25563@murdoch.acc.Virginia.EDU>
	<BURLEY.90Nov28205003@pogo.ai.mit.edu>
	<1990Nov29.041958.2254@murdoch.acc.Virginia.EDU>
Sender: news@ai.mit.edu
Organization: /home/fsf/burley/.organization
Lines: 252
In-reply-to: gl8f@astsun.astro.Virginia.EDU's message of 29 Nov 90 04:19:58 GMT

In article <1990Nov29.041958.2254@murdoch.acc.Virginia.EDU> gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:

   In article <BURLEY.90Nov28205003@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

   > I was responding (somewhat tongue in cheek, though hoping I'd
   >actually get an incredibly insightful answer) to someone who had said one can
   >strength-reduce the multiplies away in loops in response to yet another person
   >who clearly had been talking about, yes, RANDOM references.

   I was not claiming that I could strength-reduce random references. I
   was claiming that most references were in fact not random.

   >So, yes, I know (or am pretty sure) you can't.  But I asked the
   >question anyway, because I wanted to know the poster's answer, or at
   >least suggest without being overly and publically rude that perhaps he
   >shouldn't have responded to a post about random references to arrays
   >saying they can be strength-reduced.

   Well, I think you've managed to be publically rude ;-)

I see the smiley, but I do want to clarify: in no way did I consider
Greg's posting, to which I originally responded, silly or anything.
It just seemed to be a case of losing track of how the discussion
progressed, as most of us do (ok, I know I do it).  I did get a little
annoyed at the private and public responses to my post talking about
non-random references when clearly the particular subtopic had become
random references as a subclass of array references in general: "You know,
mammals are ..." (big discussion); "Well, that's usually the case, but if
you have to deal with horses, they are an exception, because they are...";
"Yes, but most mammals aren't horses, so it really isn't important!"  HUH?
Sigh.  We need an object-oriented newsgroup system (inheritance in
particular) because it is really hard to keep track of all these threads
of discussion, especially as topics are refined into subtopics without it
being clear in postings about subtopics that they are intentionally being
discussed as subtopics rather than supertopics.  Who wants to trace back
the discussion of every thread before replying?  Oh well, dream on.

   >A common example of random accesses are scatter/gather operations.  The array
   >element access computation in the following function in C:
   >
   >    void foo(a,b,c)
   >    float a[100][100];
   >    int b[100];
   >    int c[100];
   >    {
   >    int i, j;
   >
   >    for (i = 0; i < 100; ++i)
   >      for (j = 0; j < 100; ++j)
   >        a[b[i]][c[j]] = 3.1415*(i+j);
   >    }
   
   This can be strength-reduced, although I would be hard pressed to see
   how the compiler could do it automagaically. If, in Fortran, you do
   the gather with each column getting one outer loop iteration, strength
   reduction is possible. It'd look something like this:
   
       DO I = 1, NUMCOLS
          ICOL = B(I)
          DO J = 1, NUMROWS(I)
             A( C(I,J), ICOL ) = 3.1415*(whatever)
          ENDDO
       ENDDO
   
   Since ICOL is constant across the inner loop, it can be strength
   reduced. I hope I haven't switched rows and columns, I can never
   remember which one's which by name although I know which index varies
   fastest :-)

Hmm, first off, I should have specified b[(i+j)/2] in my sample program just
to make things sufficiently "random" (and done the equivalent for the
Fortran procedure).

Next, I should have pointed out that while references to <a> were random,
the references to <b> and <c> weren't, so this particular example (some kind
of scatter/gather thing, I don't write this kind of code so I'm no
expert on what it's used for) still contains array references that are
within the scope of discussion of C's pointer accesses vs. Fortran's
array accesses (though as has been discussed, with one-dimensional arrays
and optimizing compilers, there ain't much overall difference across all
computing systems used today).

Third, I think Greg's new version of the Fortran has "C(I,J)" where it
should have "C(J)" -- but I may just be confused.

Fourth, I break out in cold sweats whenever I think about writing anything
down in Fortran vs. other languages regarding multi-dimensional arrays
due to column-major vs. row-major ordering.  But when it actually
comes down to it, all I have to do is think about it in my favorite way
(which I think is called row-major), and then reverse the dimensions when
writing Fortran, and it works.  I hope.  Greg's example, aside from the
third nit above, looks fine to me.

Finally, this whole thread has been fascinating and highly informative for
two reasons (yes, here's another numbered list -- that's what you get for
reading the ramblings of a one-time tech writer):

One, I'm saving some of the juicy stuff for review when GNU Fortran is
working to serve as optimizations to check for and perhaps implement
as time becomes available.

Two, it makes me think about how both C and Fortran have fundamental
omissions/comissions regarding arrays.  C doesn't let you work with
multi-dimensional arrays directly, you must express them using arrays
of arrays (which is a different means of expression, even though it may or
may not be more efficient for a specific problem) or by writing your
own subscripting expressions for a single monolithic array (which allows
the auto-reshaping-across-procedure-boundary often called for by Fortran
programmers).  Fortran doesn't let you say that when you are expecting
an argument A to a procedure and say DIMENSION A(10,10), you really know
that the array passed will itself be dimensioned that way, and that when
passing an array to a procedure, it'll expect any optimization info on its
known shape to be passed; the result being that, unless inter-procedural
optimization is done, arrays must be passed as flat and must be received
as flat, and pointer maps to speed up multi-dimensional array accesses
can be used only entirely within the given procedure.

I don't know how easy or reasonable it would be to add multi-dimensional
arrays to C (I can see the newsgroup discussions now: "C is just as good
as Fortran for scientific computing"; "No it isn't, C doesn't have
true multi-dimensional arrays"; "Well, you can accomplish it using other
means"; "That can be inefficient"; "Well, it could be added to C fairly
easily"; "Ok, let's add it"; "Hey!  Why are you making my favorite
language bigger?  I don't need this feature!"; "Sure you do, it's great
for scientific computing"; "But I don't use C for scientific computing!").

But with Fortran there could be a means for declaring the interface to a
procedure as requiring an array of a particular shape, and the definition
of that procedure would codify that expectation, so all callers of the
procedure would not only know that they had to pass an array of that shape,
but that they could also pass any pointer maps that might be useful to
access elements of the array optimally on a given architecture, and the
procedure could assume the existence of those pointer maps.  Gee, maybe
that's what some of that stuff in the Fortran 90 spec was trying to do; I'll
have to look at it again to see if it makes more sense to me in the light
of this idea for a feature. (-:

Ultimately I always believe the better language is one in which you can
state at the highest level what you want to accomplish.  C's requirement
for either arrays of arrays or hand-coded subscript calculation (and I
don't consider macros or inline functions to affect this line of
reasoning) when you really want to simply specify a multi-dimensional array
means it is not as good as Fortran in this respect.  Fortran's self-
imposed requirement that it be able to reshape arrays across procedure
calls in all cases, by virtue of not allowing you to say "I know this
array is indeed this shape in both the called procedure and in all its
calling procedures", means it is not as good as C if you use C's array of
arrays approach (where it is a given that the actual and formal arrays are
the same shape, sort of).

When you are able to say what you mean in a language, then it is better
for maintainability, retargetability (in terms of other languages),
portability (other machines), and optimization, or at least as good.
An optimizing compiler might well be able to optimize multi-dimensional
array accesses coded that way, but not coded as low-level expressions a
la the approach one would use in C when representing a conceptual
multi-dimensional array as a single-dimensioned array in the code.

Some compilers might be able to optimize the low-level expressions as well
by recognizing patterns.  These don't have a real hard time with the
higher-level constructs; for compilation speed, they can optimize them
directly, or for internal elegance and verifiability they can "lower" them
to the equivalent low-level expressions and let the powerful optimization
at that level take care of the job.

An example of this kind of lowering and letting powerful low-level
optimization to good things is taking:

      DO 10 I=1,100
      ...
10    CONTINUE

and turning it into:

      I = 1                              (code-visible iterator)
      TEMP1 = (100-I+1)/1                (# of trips)
      TEMP2 = 1                          (increment value for iterator)
T1    IF (TEMP1.LE.0) GOTO T2            (any trips left?)
      ...
10    CONTINUE
      I = I + TEMP2                      (bump iterator for program)
      TEMP1 = TEMP1 - 1                  (countdown trip)
      GO TO T1                           (next iteration)
T2    ...

Then a general flow analyzer detects that T1:T2 is a loop and determines
whether it fits into one of various models of loops with corresponding
optimized implementations.  This way, not only is the original DO loop
shown optimized via this lowering, but code written like the lowered version
also is transformed.  Do people really write code like the lowered version?
Apparently, when you do things like dead-code elimination, flow reordering,
and other things, yes, you can end up with things that are really loops
that didn't look that way at first.  Useless-variable elimination via
def-use chains, along with array element access elimination, not only
eliminates some or all of the temporary variables created (either by
replacing their values with machine instruction equivalents, like replacing
I = I + TEMP2 with the machine's instruction for C's ++I or by at
least overlaying the temp with other temps in separate loops) but
may eliminate, for example, I as well (if the array references involving
I can be strength-reduced into, say, pointers incremented each time the
loop iterates, as one would do in C).  And some machines loop faster when
the loop ends on a value reaching 0 than reaching some other number, which
is why TEMP1 is set up to do that by calculating the number of trips.
(FYI: It is not valid standard Fortran for the program to change the value
of a DO loop iterator while its DO loop is still active, so the above
transformation safely ignores the possibility of a change to I during the
loop.  This is another reason Fortran code can sometimes be faster than
"equivalent" C code; and, of course, if the programmer isn't aware of the
restriction, violates it by modifying I during the loop and expecting the
number of trips to be adjusted accordingly, and doesn't get a compile-time
message, the result is buggy code that will work on some systems and not
others -- hey, I'm gonna go add just that kind of check to GNU Fortran right
now!)

The same can be done for array references.  Change:

A(I,J)

to:

A[(I-lbound(A,1))+((J-lbound(A,2))*extent(A,1))]

where lbound(x,y) is the lower bound of array x in dimension y, hbound
(not mentioned above) is the same for the upper bound, and extent(x,y)
is hbound(x,y)-lbound(x,y)+1.

Then, things like strength reduction and loop induction are still doable,
but might well apply to more cases than if restricted to explicit multi-
dimensional array accesses.

It is possible for an optimizer to strength-reduce too much, i.e. to the
point where the code ends up less optimal than it might otherwise have been,
on a machine where resource and instruction scheduling is important.  It
might be that the resulting intermediate code requires too many operations
of the adder unit and yet leaves the multiplier unit twiddling its thumbs
much of the time.  To keep this from happening, one could use a backtracking
mechanism directed to guess at a best goal for some code and repeatedly
try various optimizations and schedule the results until that goal is
achieved, and keep track of the best achievement in the meantime if it
isn't -- this'll handle the problem by not keeping the completely
strength-reduced version as the best achieved.  But I lean towards going
ahead and doing the strength-reduction, then providing information for
the instruction and resource scheduler so that it can selectively move
candidate operations off, in this case, a busy adder unit and onto an
idle multiplier unit, and automatically generate any support code that
kind of adjustment requires (and make sure things don't get worse in
the process), because it's the scheduler that knows just where such
movement is needed, not the code that decides whether to strength-reduce
or "induct" an expression.  But the whole proposition is RISCy... :-)
--

James Craig Burley, Software Craftsperson    burley@ai.mit.edu