Xref: utzoo comp.lang.c++:5816 comp.lang.fortran:2719
Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!stat!stat.fsu.edu!mccalpin
From: mccalpin@masig3.ocean.fsu.edu (John D. McCalpin)
Newsgroups: comp.lang.c++,comp.lang.fortran
Subject: Re: inline and vectorization
Message-ID: <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu>
Date: 7 Dec 89 15:57:39 GMT
References: <sZTKVau00VoLMDkUw5@andrew.cmu.edu> <40827@lll-winken.LLNL.GOV>
Sender: news@stat.fsu.edu
Followup-To: comp.lang.c++
Organization: Supercomputer Computations Research Institute
Lines: 88
In-reply-to: jac@muslix.llnl.gov's message of 7 Dec 89 16:00:52 GMT

In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James
Crotinger) writes:
>  However I also have other concerns, which are generic to languages
>that support vector data types (ala CFT77 and Fortran 8x). Suppose
>I have a vector type and the following code:

>   vector A, B, C, D, E
>   E = A + B*C     // meaning elementwise multiplication
>   D = A - B*C 

>This is the style of programming that the vector syntax promotes. 

This is also the style of programming that is appropriate to
memory-to-memory vector machines (Cyber 205 and ETA-10), and (more
importantly) for SIMD parallel machines like the Connection Machine.
The code above runs at the same speed on the ETA-10 (for example)
whether B*C is pre-calculated or not, since the extra multiply can be
completely overlapped with the subtract in the second line.

Of course coding for the ETA-10 is not an interesting issue for most
of us these days, but I consider it very important to maintain a
reasonable level of source code compatibility between codes for the
Connection Machine and the Cray Y/MP and for other machines which have
insufficient memory bandwidth to run vector operations in the
streaming mode shown above.  These machines include: Cray-2, Convex,
IBM 3090, most (all?) Japanese supercomputers, Ardent Titan, as well
as most machines on the drawing boards (names withheld since I am
under non-disclosure on several of these.)  Even the Cray X/MP and
Y/MP benefit from reducing the memory traffic since this minimizes the
bank conflicts suffered in a multi-processing environment.

>My question
>is, how smart will the compilers get. Will compilers evaluate the common
>subexpression (B*C) once or twice?

I don't know of *any* vectorizer/optimizer which will do this sort of
optimization on vector quantities. Anyone from Cray care to comment on
the current status of the Cray compiler on this code? 

It is *very* important that this capability be developed, since more
and more machines are going to be memory-bandwidth-deficient in the
next few years.

>With the cfront model, the B*C stuff will
>end up in separate loops and it is highly unlikely that the compilers
>subrexpression analizer will pick it up. I think what it boils down to is
>this: will the compilers be able to do "loop jamming" on the loops that 
>are implied by the vector syntax. Even in Fortran, if you coded:

>   do i = 1, n
>     E(i) = A(i) + B(i) * C(i)
>   end do
>   do i = 1, n
>     D(i) = A(i) - B(i) * C(i)
>   end do

>the optimizer would not eliminate the common subexpression. But in fortran
>you'd never do this (well, I'd never). The loops would be "jammed" together:

>   do i = 1, n
>     E(i) = A(i) + B(i) * C(i)
>     D(i) = A(i) - B(i) * C(i)
>   end do

And converting array notation into this combined form requires a
significant data dependency analysis....  I think that part of the
problem is that vectorizers have been developed as stand-alone
source-to-source translators, and the optimization aspect of this
translation has been pretty minimal to date.

>Now the optimizer will optimize the heck out of this. Not only will the
>B*C product only be evaluated once, but most likely A, B, and C will
>only be fetched once. This latter point is not insignificant on a vector
>machine. Thus even if you write:

>    vector TMP = B * C
>    E = A + TMP
>    D = A - TMP

>if the underlying C compiler (or fortran compiler) can't figure out
>how to "jam" the loops itself, this will still be less efficient
>than hand coding the loop (and it takes more memory). 

yep....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu