Path: utzoo!attcan!uunet!ncrlnk!ncrcae!ece-csc!mcnc!xanth!nic.MR.NET!tank!uxc!uxc.cso.uiuc.edu!uxg.cso.uiuc.edu!uxe.cso.uiuc.edu!hirchert
From: hirchert@uxe.cso.uiuc.edu
Newsgroups: comp.lang.fortran
Subject: Re: FORTRAN 88
Message-ID: <50500089@uxe.cso.uiuc.edu>
Date: 4 Nov 88 17:42:00 GMT
References: <669@convex.UUCP>
Lines: 126
Nf-ID: #R:convex.UUCP:669:uxe.cso.uiuc.edu:50500089:000:6904
Nf-From: uxe.cso.uiuc.edu!hirchert    Nov  4 11:42:00 1988


Bill Leonard(bill@hcx2.SSD.HARRIS.COM) writes about the cost of using the
Fortran 8x array notation on a scalar processor.  First, I would like to pick
a couple of nits:

>As for the array language, let me offer a concrete example.  Consider
>the following F77 fragment:

>      S = 0
>      DO 10 I = 1,N
>         DO 20 J = 1,M
>            B(J,I) = C(I,J)
>   20    CONTINUE
>         K = I + 100
>         S = S + A(K)
>   10 CONTINUE

>                                                    Now consider the
>equivalent F8x code:

>B(1:M,1:N) = TRANSPOSE(C(1:N,1:M))
>S = SUM(A(100:N+100))
           ^^^  not that it matters, but this should be 101.

>(Clearly, in this example the loop overhead must have been significant,
>or the programmer wouldn't have written the F77 code as he did.)

On a number of scalar machines I have worked on, the separate loop version might
actually be faster (due to locality of reference, instruction caches, and other
such effects).  Even on machines where the combined loop version is faster, the
difference is rarely significant in absolute terms, so that its significance
in relative terms is typically important if this computation in embedded in yet
another loop so that a significant part of the overall program is spend doing
this operation.  My experience suggests that while there are a few programmers
that would evaluate these kind of issues, most do not.  The most likely
reasons for the programmer to have combined the loops are
. that the effective sizes of A, B, and C are related and the programmer wishes
  to insure that the size computed on is the same
. that the programmer has been told it is more efficient to combine loops and
  does so without verifying that this is true for his program on his machine
. that it takes one less DO statement and one less CONTINUE statement to
  write the combined loop
The last reason (compact notation) may well be the most likely.

>This example, while real enough, is not nearly as complicated as what one
>typically encounters in real applications.

If you mean that the most complicated statement in a typical application would
be more complicated you may be right.  If you mean that the typical statement
in a typical application would be more complicated, I disagree; most
applications consist primarily of a large number of very simple statements.

>                                            Just as most of the analysis
>for vectorization involves making sure that the vectorized code gets the
>same answer, the same is true for the "scalarization" of array-language
>code.

Here I am in total agreement.  In fact, this will be true of register-based
vector processors (such as those from Cray Research or Alliant) as well.  The
conversion between parallel notation and efficient sequential notation involves
reordering of operations the must be analyzed for validity.  For the most part,
it doesn't matter which direction you are converting or how far, the basic
analysis is the same.

As I see it, array notation offers three benefits:
. It is a more compact notation (and thus, for some people, a more convenient
  notation) for expressing many common operations on arrays.
. Its basic semantics are more highly parallel (at the cost of higher
  temporary storage usage).  In many cases, analysis can be done to determine
  the equivalence of this parallel notation and the traditional sequential
  notation.  In cases where this analysis either cannot be done or where it is
  not done because of cost, the parallel notation will produce better
  performance on machines which support parallel or overlapped evaluation.
  (Originally, all the parallel machine were top of the line machines.  Today,
  there are a number of "middle tier" machine with these properties and we are
  beginning to see add-on boards with these properties for machines in the
  workstation class.  In other words, parallel and overlapped execution issues
  are affecting an ever increasing fraction of the Fortran community.)
. The array notation puts under processor control a number of aspects of
  array computation where getting good performance is processor dependent.
  For example, when doing array operations on a two-dimensional array, there
  are a number of factors the go into the decision of which loop is the inner
  loop, including locality of reference, memory bank conflicts, and relative
  length of the two loops.  The relative importance of these factors varies
  from processor to processor, so on one machine it may be best one way while
  on another machine it may be better the other way.  Thus, a sequential
  statement of the operation will be optimal for only one of those machines.
  The parallel notation lets the processor choose the inner loop and thus be
  optimal one both machines.

It also has costs:
. It adds to the complexity of expression analysis, etc.
. In cases where the parallel-sequential equivalency analysis can not or is not
  done, the parallel notation will produce worse performance on purely
  sequential machines.  (Not necessarily significantly worse, but definitely
  worse)
. In the added factors under the processor's control, if the analysis is done
  incorrectly (or not done), the wrong decision may be made, producing worse
  code than what the programmer could have written using the sequential
  notation.

On some machines, the benefits outweigh the costs.  On others, the costs
outweigh the benefits.  From the number of vendors already implementing the
array notation, it is clear that the array notation is likely to be available
on machines where the benefits outweigh the costs.  Given that, I ask the
following questions:
. Do we want to standardize the notation available on such machines (to improve
  portability among such machines)?
. Is it likely that people working on machines where the costs outweigh the
  benefits will nevertheless wish to import programs orginally written on
  machines where the array notation is available?

Finally, I might note that even if the array notation is adopted as part of the
Fortran standard, it would be possible to leave it out of your compiler if you
provide a separate tool that converts the parallel notation to a sequential
notation accepted by the compiler.  (In this case, the "standard-conforming
processor" would be the combination of the compiler and the tool.)  On some
machines this might be a beneficial implementation strategy:
. When importing programs written using the array notation, the cost of
  analyzing that notation to sequentialize could be paid just once.
. When writing codes on the machine, the user could decide whether or not to
  use the array notation by deciding for himself whether the notational
  benefits justified the cost of the extra analysis on each translation cycle.

Kurt W. Hirchert     hirchert@ncsa.uiuc.edu
National Center for Supercomputing Applications