Xref: utzoo comp.lang.c++:5837 comp.lang.fortran:2727
Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!uwm.edu!lll-winken!muslix!jac
From: jac@muslix.llnl.gov (James Crotinger)
Newsgroups: comp.lang.c++,comp.lang.fortran
Subject: Re: inline and vectorization
Message-ID: <40997@lll-winken.LLNL.GOV>
Date: 8 Dec 89 23:03:30 GMT
References: <sZTKVau00VoLMDkUw5@andrew.cmu.edu> <40827@lll-winken.LLNL.GOV> <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu>
Sender: usenet@lll-winken.LLNL.GOV
Reply-To: jac@muslix.UUCP (James Crotinger)
Followup-To: comp.lang.c++
Organization: Lawrence Livermore National Laboratory/UC Davis
Lines: 53

In article <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>This is also the style of programming that is appropriate to
>memory-to-memory vector machines (Cyber 205 and ETA-10), and (more
>importantly) for SIMD parallel machines like the Connection Machine.
>The code above runs at the same speed on the ETA-10 (for example)
>whether B*C is pre-calculated or not, since the extra multiply can be
>completely overlapped with the subtract in the second line.
>
  I guess I find that a bit hard to buy. The X-MP also does chaining,
but the above example runs 33% slower on our X-MP when written using
Cray Fortran's vector notation than it does when the loop is written
out explicitly, with both loops loops jammed into one. Interestingly,
on the Cray 2, which does no chaining, the vector style version is only
20% slower. I suspect that the memory bandwidth is what's really the 
killer here. In the version which is written out as one jammed loop,
the Cray should do the following:

      loop:

         load 64 elements of A
         load 64 elements of B
         load 64 elements of C
         calculate E = A + B*C 	(for 64 elements)
         calculate D = A - B*C 	(ditto)
         store E		(ditto)
         store D		(ditto)
         goto loop

(with appropriate logic to end the loop). The savings of not having 
to go out to memory to get A, B, and C twice are not small at all. 
Furthermore, on the Cray 2 stuff like storing E can be overlapped
with the calculation of D...

>
>>My question
>>is, how smart will the compilers get. Will compilers evaluate the common
>>subexpression (B*C) once or twice?
>
>I don't know of *any* vectorizer/optimizer which will do this sort of
>optimization on vector quantities. Anyone from Cray care to comment on
>the current status of the Cray compiler on this code? 
>
>It is *very* important that this capability be developed, since more
>and more machines are going to be memory-bandwidth-deficient in the
>next few years.
>

  Exactly.
>John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
>		   mccalpin@scri1.scri.fsu.edu
>		   mccalpin@delocn.udel.edu

  Jim