Xref: utzoo comp.lang.c:34535 comp.lang.fortran:4300 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!validgh!dgh From: dgh@validgh.com (David G. Hough on validgh) Newsgroups: comp.lang.c,comp.lang.fortran Subject: Fortran vs. C, and noalias: old postings, new data Message-ID: <221@validgh.com> Date: 6 Dec 90 04:22:18 GMT Followup-To: poster Organization: validgh, PO Box 20370, San Jose, CA 95160 Lines: 234 The current threads of discussion prove that nothing is ever settled for long on USENET. Following are some new data followed by some old postings which explain the data, even though they apply to the hardware and software available a couple of years ago. The current data reflect Sun-4/75 and Sun-4/470 hardware, and developmental versions of Sun Fortran 1.4 and (conventional, not ANSI) C 1.1. (Released versions of these products may vary). Fortran compilation options included -dalign, -libmil, -O4, -cg89, and -Bstatic, while C added -fsingle to these. So the C compilation environment was made as much like Fortran's as possible; these particular compilers share global and local optimizers, code generators, and math libraries. Note that -O4 implies that some source-level inlining may occur, while -libmil implies that assembly-language inline expansion templates (that don't set errno or call matherr) are used to remove most of C's disadvantage with respect to Fortran in this regard. The following table shows RATES (operations/second). SP is single precision, DP double precision; roll is a Fortran source-rolled 100x100 Linpack source, croll is a fairly direct hand translation of that into C; unroll and cunroll are corresponding source-unrolled versions; similarly for whetstone and cwhetstone. Fortran and C sources were adapted from versions available publicly on netlib [send a message "send index from benchmark" to netlib@ornl.gov. Neither of these benchmarks is good for comparing systems but they are adequate for comparing certain language features. 4/75 4/490 SP.roll MFLOPS 6.3 5.2 SP.croll 4.7 3.5 DP.roll 3.8 4.1 DP.croll 3.0 2.8 SP.unroll 4.6 3.4 SP.cunroll 4.5 3.4 DP.unroll 3.0 2.7 DP.cunroll 3.0 2.7 SP.whetstone MWHETS 29 25 SP.cwhetstone 22 20 DP.whetstone 23 22 DP.cwhetstone 20 19 A proprietary geometric-optics benchmark from one of our customers showed comparable C and Fortran performance because it's dominated by sqrt rather than array processing. C performance would have been much worse than Fortran, but for -libmil and -fsingle. I would recommend obtaining the source codes from netlib and comparing the results on other systems that have common-component C and Fortran compilers. You won't see any difference on Macintoshes or PC clones that lack hardware floating point or have simple implementations like 6888x or 80x87. The differences should be apparent on on aggressively optimized 80486 and 68040 systems, however, as well as on most of the current RISC-based systems. Here's the explanation. The context was a recent posting from dmr demanding removal of noalias from the ANSI-C draft; the following is slightly edited to preserve the main points: ------------------------------------------------------------------------------- >From postnews Fri May 20 18:33:48 1988 Subject: no noalias not negligible - a difference between C and Fortran - long Newsgroups: comp.lang.c,comp.lang.fortran Status: R noalias may be non-negotiable, but it may be non-negligible, as I found out somewhat to my surprise this week. At various times I've needed a version of The Linpack Benchmark written in C, usually because a Fortran compiler was not available; The Linpack Benchmark is very useful for lots of surprising purposes, like debugging caches and virtual memory systems. Of course, The Linpack Benchmark is by definition written entirely in Fortran, so for benchmarking purposes a C translation is more or less useless. This despite an observation by one of my colleagues, viewing the C version, that it appeared to still be written in Fortran. But faithful preservation of Fortran semantics, including memory access patterns, was one of the goals of the translation. My new manager wanted to tackle a small technical job to keep in shape, and so I suggested this translation. It was not quite as small a job as we expected, but eventually she got it to produce identical numerical results as the Fortran version, but she never could get comparable performance. The results for double precision linpack on Sun-4 using SunOS 4.0 and Fortran 1.1 were: Rolled Source Unrolled Source Fortran 1080 KFLOPS 875 KFLOPS C 850 KFLOPS 875 KFLOPS Why was C slower than Fortran in the rolled case? It turns out that almost all The Linpack Benchmark does is execute a short subroutine of the following simplified Fortran forms: subroutine daxpy(n,da,dx,dy) doubleprecision dx(1),dy(1),da integer i,n do 30 i = 1,n,4 dy(i) = dy(i) + da*dx(i) dy(i+1) = dy(i+1) + da*dx(i+1) dy(i+2) = dy(i+2) + da*dx(i+2) dy(i+3) = dy(i+3) + da*dx(i+3) 30 continue return end OR subroutine daxpy(n,da,dx,dy) doubleprecision dx(1),dy(1),da integer i,n do 30 i = 1,n dy(i) = dy(i) + da*dx(i) 30 continue return end The first of these is the standard UNROLLED form of the program; the second is a questionably legal modification called the ROLLED form. The original Fortran was written with unrolled source because that generated more efficient code ten years ago when most compilers didn't unroll loops for you automatically. Nowadays many Fortran compilers, including Sun's, can get better performance by unrolling the loops themselves than by attempting to figure out unrolled source code. Most of the benefit of loop unrolling in high-performance systems derives from the possibilities it opens up for instruction scheduling across independent iterations. The current multiplication can overlap the previous addition, or current computation can overlap previous stores and subsequent loads; what is worthwhile varies among implementations. The corresponding rolled C code could be written with a for loop daxpy(n, da, dx, dy ) double dx[], dy[], da; int n; { int i; for (i = 0; i < n; i++) { dy[i] = dy[i] + da * dx[i]; } } [and this is actually moved inline to the calling function (dgefa), where it gets unrolled. But much of the benefit of unrolling is lost.] If the source form is unrolled, however, the optimizer can't do as much with Fortran, and C performance is the same: optimizers do a lot better work with simple loops than with clever ones. Investigation revealed that the reason had to do with noalias: all Fortran pointer variables (which happen to be exactly the set of procedure parameters) are defined by the Fortran standard to be "noalias", meaning a compiler may optimize code based on the assumption that the pointers never reference the same memory. Alleged Fortran programs which break under such optimization are declared by the Fortran standard to be non-standard. Very neat. C, in contrast, has other kinds of pointer variables than procedure parameters, and many people believe that a global decree of the Fortran type would break a lot of existing C programs. So the default is that optimization must assume that any two pointers may be pointing to the same thing unless it can prove otherwise. For a while X3J11 had a local "noalias" attribute that you could attach to pointers, but later recanted in consideration to assertions like 1) nobody had done it before, which is probably true, 2) nobody could agree on exactly what it meant, which appeared to be true, and 3) optimizing compilers should be able to figure out if aliasing exists, which is definitely false in a separate compilation environment (unless you want the linker to recompile everything, in which case the linker is the compiler, and you're back to no separate compilation). Anyway there is no portable way in draft ANSI C to say "this pointer is guaranteed to have no aliases". Thus the first part of the C compiler does NOT tell the optimizer that any pointers are guaranteed unaliased; the optimizer won't unroll anything if there are potential aliasing problems: you don't dare load dx[i+1] before you store dy[i] if there is any danger that they point to the same place. The Fortran compiler need have no such qualms. What is to be done? I submitted extensive commentary to X3J11 during the last public review period about numerical issues, but didn't mention noalias because it was such a hot potato and I didn't think it mattered much, not having investigated the possibilities. Even if noalias could be proved to be unquestionably a good idea, I doubt X3J11 would want to change its draft again, since such proofs seem so easy to overturn. Perhaps what will happen is that high-performance C compilers will adopt the questionable CDC/Cray Fortran practice of providing "unsafe" optimization levels that, for instance, assume all pointers are unaliased. ------------------------------------------------------------------------------- >From postnews Thu May 26 16:59:09 1988 Status: R It's also worth noting that several commentators referred to this as a problem with vector machines, so that for instance PC programmers may have concluded that it was irrelevant to their careers. However the measurements taken above were on a Sun-4/260 which is not a vector machine, but does allow the integer unit, the floating-point adder, and the floating-point multiplier to operate in parallel in certain circumstances which arise frequently when the instruction scheduler does its job well. It's to be expected that floating-point units of comparable complexity will be common in PC-class machines soon; the Weitek 1167 is an example of such a unit intended to be installed in 80386 systems. It has the same floating-point ALU and multiplier as the Sun-4/260. Don't assume that consequently 80386-based PC's will have floating-point throughput comparable to the Sun-4, however; there is a small matter of memory bandwidth to contend with, which becomes more and more noticeable as the floating-point hardware becomes faster, until at the top end it is the main issue and the floating-point hardware is a minor distraction. So the issues of loop unrolling and instruction scheduling (and identifying unaliased operands) will soon become important to everybody interested in getting the most out of even simple scalar-oriented systems. -- David Hough dgh@validgh.com uunet!validgh!dgh na.hough@na-net.stanford.edu Brought to you by Super Global Mega Corp .com