Xref: utzoo comp.lang.c:10264 comp.lang.fortran:678
Path: utzoo!utgpu!water!watmath!clyde!bellcore!faline!thumper!ulysses!andante!princeton!udel!gatech!ncar!ames!pasteur!ucbvax!decwrl!sun!dgh!dgh
From: dgh@dgh.UUCP
Newsgroups: comp.lang.c,comp.lang.fortran
Subject: no noalias not negligible - a difference between C and Fortran - long
Message-ID: <54080@sun.uucp>
Date: 21 May 88 01:36:43 GMT
Sender: news@sun.uucp
Lines: 151
Posted: Fri May 20 21:36:43 1988

noalias may be non-negotiable, but it may be non-negligible, as I found 
out somewhat to my surprise this week.
 
At various times I've needed a version of The Linpack Benchmark
written in C, usually because a Fortran compiler was not available;
The Linpack Benchmark is very useful for lots of surprising purposes,
like debugging caches and virtual memory systems.  Of course, The Linpack
Benchmark is by definition written entirely in Fortran, so for benchmarking
purposes a C translation is more or less useless.  This despite an
observation by one of my colleagues, viewing the C version, that
it appeared to still be written in Fortran.  But faithful preservation of
Fortran semantics, including memory access patterns, was one of the goals
of the translation.

My new manager wanted to tackle a small technical job keep in shape,
and so I suggested this translation.  It was not quite as small a job
as we expected, but eventually she got it to produce identical numerical
results as the Fortran version, but she never could get comparable performance.

The results for double precision linpack on Sun-4 using SunOS 4.0 and
Fortran 1.1 were:

		Rolled Source		Unrolled Source

Fortran		1080 KFLOPS		875 KFLOPS
C		 850 KFLOPS		875 KFLOPS

Why was C slower than Fortran in the rolled case?

It turns out that almost all The Linpack Benchmark does is execute
a short subroutine of the following simplified Fortran forms:

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n,4
        dy(i)   = dy(i)   + da*dx(i)
        dy(i+1) = dy(i+1) + da*dx(i+1)
        dy(i+2) = dy(i+2) + da*dx(i+2)
        dy(i+3) = dy(i+3) + da*dx(i+3)
 30   continue
      return
      end

OR

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
 30   continue
      return
      end

The first of these is the standard UNROLLED form of the program;
the second is a questionably legal modification called the ROLLED form.  The
original Fortran was written with unrolled source because that generated
more efficient code ten years ago when most compilers didn't unroll
loops for you automatically.  Nowadays many Fortran compilers,
including Sun's, can get better
performance by unrolling the loops themselves than by attempting to
figure out unrolled source code.

Most of the benefit of loop unrolling in high-performance systems derives
from the possibilities it opens up for instruction scheduling across
independent iterations.  The current multiplication can overlap the
previous addition, or current computation can overlap previous stores
and subsequent loads; what is worthwhile varies among implementations.

The corresponding rolled C code could be written with a for loop

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        for (i = 0; i < n; i++) {
                dy[i] = dy[i] + da * dx[i];
        }
}
 
but Sun compilers catch a red herring: for loops are compiled into two
basic blocks and the unroller won't unroll loops with more than one basic
block.  So to bypass that issue, rewrite a little bit more in the form

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        i = 0;
        do {
                dy[i] = dy[i] + da * dx[i];
        } while (i++ < n);
}

In this form the optimizer still won't unroll it, 
and consequently the performance is less than Fortran.  If the source
form is unrolled, however, the optimizer can't do as much with Fortran,
and C performance is the same.

Investigation revealed that the reason had to do with noalias:  all Fortran
pointer variables (which happen to be exactly the set of procedure parameters)
are defined by the Fortran standard to be "noalias", meaning a compiler
may optimize code based on the assumption that the pointers never reference
the same memory.  Alleged Fortran programs which break under such optimization
are declared by the Fortran standard to be non-standard.  Very neat.

C, in contrast, has other kinds of pointer variables than procedure
parameters, and many people believe that a global decree of the Fortran type 
would
break a lot of existing C programs.   So the default is that optimization
must assume that any two pointers may be pointing to the same thing unless
it can prove otherwise.  For a while X3J11 had a local "noalias" attribute
that you could attach to pointers, but later recanted in consideration to
assertions like 1) nobody had done it before, which is probably true, 
2) nobody could agree on exactly what it meant, which appeared to be
true, and 3) optimizing compilers should be able to figure out if aliasing
exists, which is definitely false in a separate compilation environment
(unless you want the linker to recompile everything, in which case the
linker is the compiler, and you're back to no separate compilation).
Anyway there is no
portable way in draft ANSI C to say "this pointers are guaranteed
to have no aliases".

Thus the first part of the C compiler does NOT tell the optimizer that any
pointers are guaranteed unaliased; the optimizer won't unroll anything if
there are potential aliasing problems: you don't dare load dx[i+1] before
you store dy[i] if there is any danger that they point to the same place.
The Fortran compiler need have no such qualms.

What is to be done?  I submitted extensive commentary to X3J11 during the
last public review period about numerical issues, but didn't mention noalias
because it was such a hot potato and I didn't think it mattered much, not
having investigated the possibilities.  Even if noalias could be proved
to be unquestionably a good idea, I doubt X3J11 would want to change its draft
again, since such proofs seem so easy to overturn.  Perhaps what will happen
is that high-performance
C compilers will adopt the questionable CDC/Cray Fortran practice 
of providing "unsafe" optimization levels that, for instance,
assume all pointers are unaliased.


David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough