Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!think!compass!sacilott
From: sacilott@compass.com (Roger Sacilotto)
Newsgroups: comp.lang.fortran
Subject: Re: inline and vectorization
Summary: description of COMPASS vectorization/optimization technology,
	 introduction to data optimization
Message-ID: <1370@compass.com>
Date: 15 Dec 89 16:43:42 GMT
References: <sZTKVau00VoLMDkUw5@andrew.cmu.edu> <40827@lll-winken.LLNL.GOV> <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu>
Organization: Compass, Inc., Wakefield, MA
Lines: 161

On 7 Dec 89 15:57:39 GMT,
mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) said:

> In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James
> Crotinger) writes:
>>  However I also have other concerns, which are generic to languages
>>that support vector data types (ala CFT77 and Fortran 8x). Suppose
>>I have a vector type and the following code:

>>   vector A, B, C, D, E
>>   E = A + B*C     // meaning elementwise multiplication
>>   D = A - B*C 

>>This is the style of programming that the vector syntax promotes. 

> This is also the style of programming that is appropriate to
> memory-to-memory vector machines (Cyber 205 and ETA-10), and (more
> importantly) for SIMD parallel machines like the Connection
> Machine.  [etc...]

>>My question
>>is, how smart will the compilers get. Will compilers evaluate the common
>>subexpression (B*C) once or twice?

> I don't know of *any* vectorizer/optimizer which will do this sort
> of optimization on vector quantities. Anyone from Cray care to
> comment on the current status of the Cray compiler on this code?

Some compilers I have contributed to here at COMPASS include
vectorizer/optimizer technology which performs these types of
optimization on both source-level vector operations as well as
vectorized scalar source code.  This technology has been applied in
various combinations to all COMPASS compilers, including the Fortran
compiler developed for the Suprenum supercomputer consortium in
Germany, the Connection Machine Fortran compiler developed for
Thinking Machines, Inc., and the MasPar Fortran compiler being
developed for MasPar Computer Corporation.


> It is *very* important that this capability be developed, since
> more and more machines are going to be memory-bandwidth-deficient
> in the next few years.

>>With the cfront model, the B*C stuff will
>>end up in separate loops and it is highly unlikely that the compilers
>>subrexpression analizer will pick it up. I think what it boils down to is
>>this: will the compilers be able to do "loop jamming" on the loops that 
>>are implied by the vector syntax. Even in Fortran, if you coded:

>>   do i = 1, n
>>     E(i) = A(i) + B(i) * C(i)
>>   end do
>>   do i = 1, n
>>     D(i) = A(i) - B(i) * C(i)
>>   end do

>>the optimizer would not eliminate the common subexpression. But in fortran
>>you'd never do this (well, I'd never). The loops would be "jammed" together:

>>   do i = 1, n
>>     E(i) = A(i) + B(i) * C(i)
>>     D(i) = A(i) - B(i) * C(i)
>>   end do

> And converting array notation into this combined form requires a
> significant data dependency analysis....  I think that part of the
> problem is that vectorizers have been developed as stand-alone
> source-to-source translators, and the optimization aspect of this
> translation has been pretty minimal to date.

Our vectorizer is integrated into the middle of our compiler, after
semantic analysis, and before strip mining, or sectioning.  This
architecture allows the vectorizer to be very knowledgable about
optimization.  In fact, our vectorizer has the opposite problem in
terms of source-to-source translation: Since knowledge of the
optimizer is embedded in the vectorizer, unparsed output can be hard
to produce in readable form.

>>Now the optimizer will optimize the heck out of this. Not only will the
>>B*C product only be evaluated once, but most likely A, B, and C will
>>only be fetched once. This latter point is not insignificant on a vector
>>machine. Thus even if you write:

>>    vector TMP = B * C
>>    E = A + TMP
>>    D = A - TMP

>>if the underlying C compiler (or fortran compiler) can't figure out
>>how to "jam" the loops itself, this will still be less efficient
>>than hand coding the loop (and it takes more memory). 

> yep....

Loop jamming (we call it loop fusion) and loop interchange are
essential to our technology.  Our compilers can fuse and interchange
user-written loops, vectorizer-generated loops and
strip-mine-generated loops when possible.

Another issue, mainly in massively-parallel SIMD architectures, is
inter-processor communication costs.  The example

	E = A + B * C
	D = A - B * C

can still be expensive if the associated elements of A, B, C, D and E
don't all "live" in the same processor.  Using the common
subexpression helps, but if the result has to be moved to three
separate locations (E, D, and A), communication costs could swamp the
costs of the computation and memory instructions.  Our approach here
is to notice that occurences of arrays may be "allocated" (distributed
among the processors) differently depending on usage.  This example
was taken from an existing program.

	DO I = 1,IMAX
	  DO J= 1,JMAX
	    TEMP(:) = A(I,J,:)
	    A(I,J,:) = B(I,J,:)
	    B(I,J,:) = TEMP(:)
	  ENDDO
	ENDDO

If allocation is done naively, then this example will result in
communication in both the def and use of TEMP, or 2*IMAX*JMAX
communications.  Analysis of usage can tell you that A and B can be
allocated identically ("aligned") in all dimensions and that TEMP can
be aligned with the third dimension of A and B.  This alignment will
eliminate the need for any communication (i.e., cost*2*IMAX*JMAX -> 0)

If multiple occurences of the same array are allocated differently,
then communication is inserted into the IR at points which minimize
the cost.  Notice the allocation of A in this example:

	A = B * C	   ! allocation of A is LOC1
			   ! insert communication here of 
			   !   A from LOC1->LOC2
	DO I = 1,N
	   D = A + F(I)	   ! allocation of A is LOC2
	   etc...
	ENDDO

If the communication is "on demand" inside the loop, then there are N
communications of A, rather than 1 if done before the loop.  

For more on data optimization, see

Knobe, Kathleen, Lukas, Joan D., and Steele, Guy L. Jr.  "Massively
Parrallel Data Optimization."  Proceedings of the 2nd Symposium on the
Frontiers of Massively Parallel Computation (Oct 1988), 551-558

Knobe, Kathleen, Lukas, Joan D., and Steele, Guy L. Jr.  "Data
Optimization: Allocation of Arrays to Reduce Communication on SIMD
Machines."  Technical Report CA-8902-2801.  Compass, Inc., Wakefield
MA.  (to appear in the Feb 1990 issue of the Journal of Parallel and
Distributed Computing)


  Roger Sacilotto,  Compass, Inc.  
  550 Edgewater Dr., Wakefield, MA 01880, USA.
  think!compass!sacilott
  (617) 245-9540