Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!ames!orville.nas.nasa.gov!fouts
From: fouts@orville.nas.nasa.gov (Marty Fouts)
Newsgroups: comp.arch
Subject: Macho flops versus Megaflops (was Re: ETA10-P performance)
Message-ID: <3322@ames.arpa>
Date: Sat, 7-Nov-87 14:06:15 EST
Article-I.D.: ames.3322
Posted: Sat Nov  7 14:06:15 1987
Date-Received: Mon, 9-Nov-87 07:03:52 EST
References: <676@zycad.UUCP>
Sender: usenet@ames.arpa
Reply-To: fouts@orville.nas.nasa.gov.UUCP (Marty Fouts)
Lines: 67

Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10
MFLOP on Linpack, and if other machines such as the Cray 2 have the
same problem.  The answer to the second question is yes; most machines
have a different peak number and "average" number.  I believe that an
EPA sticker needs to be put on (super)computer claims:  'Vendor
factory calculations show a maximum performance of X Units.  Use this
number as a guide, your floppage may vary, according to programming
style and problem conditions.'

The advertised peak performance number is just that; peak performance.
It is frequently refered to around here as the "guarenteed not to
exceed this speed" number; and is usually obtained (for a
supercomputer) by application of the following logic:

The machine, when running in full blown vector mode can pump out 1 FLOP
result per functional unit per N clock periods.  (We try to make N 1
also ;-)  It has M functional units which can be active simultaneously
and a clock period of T nanoseconds.  Therefore, if you have an
application which can be coded to use every functional unit, and is
entirely vector in performance you can achieve M / (N * T) FLOPS per
second.  This is the rate the vendor quotes.

On a real application, this rate can be slowed by many things.  First
of all, your application isn't entirely vector adds and multiplies, it
has to do other work.  This leads to the vector/scalar trade off which
Gene Amdahl loves so much -- Hot vector computers aren't nearly
as hot as scalar machines.  (A 10 to 1 time ratio is not uncommon)
If you have a code which is 10% scalar and 90% vector on a machine
which has the 10 to 1 performance ratio, that code is going to spend
as much time doing scalar work as it does doing vector work.

Secondly, even if your application is all vector, there is probably
some architectural gotcha that will keep the machine getting peak
performance, such as you really need 3 adds and 1 multiply, but the
machine has 2 adders and 2 multipliers, so the multiplier is idle part
of the time while the adder handles the extra work load.  There are
many of these kinds of gotchas, relating to vector length, number and
type of functional units, and memory reference patterns.  (On the Cray
2, it can take 1.5 times as long to reference the same memory bank
twice in a row as it does to reference two successive memory banks.)

Thirdly, there is the quality of the compiler technology.  The better
a compiler is at detecting optimizable code the better performance it
can achieve.  Originally, the Cray 2 C compiler would produce about 7
million whetstones and the Fortran compiler about 15 million.  Now,
the C compiler produces 11 and the Fortran compiler 20.  (Dusty deck
double precision in all four cases.)

Finally, I/O can do you in.  You might have a machine with a small
physical memory and backing stores, such as SSD on an X/MP or virtual
memory on a 205, and you have to keep moving your data between very
fast main memory and not so fast backing store, so that your CPU can
get to it.  (Small is relative;  X/MP 4-16 has 16 MWord = 128 MByte of
main memory, which is big compared to a PC, but small compared to the
2048 MByte of memory on a Cray 2.  The key feature is that all of the
data being crunched doesn't fit.)

And all of these things occur at a gross physical level, so that the
programmer has to be painfully aware of them.  I have written simple
loops on the Cray 2 in C or Fortran which get 15 - 20 MFlops, which
can be replaced by hand coded assembly and get 150-1200 MFlops.

The bottom line is that the vendor reports the peak speed and Linpack,
the Livermore Loops, the NAS kernels, Whetstone, et al. report how the
vendor's compiler technology translates a particular algorithm into
code to run the vendors architecture.  My favorite pathological case
is a C program I wrote which runs twice as fast on a Vax as on the
Cray 2; simply because I coded for pathological behavior on the 2.