Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!ames!orville.nas.nasa.gov!fouts From: fouts@orville.nas.nasa.gov (Marty Fouts) Newsgroups: comp.arch Subject: Macho flops versus Megaflops (was Re: ETA10-P performance) Message-ID: <3322@ames.arpa> Date: Sat, 7-Nov-87 14:06:15 EST Article-I.D.: ames.3322 Posted: Sat Nov 7 14:06:15 1987 Date-Received: Mon, 9-Nov-87 07:03:52 EST References: <676@zycad.UUCP> Sender: usenet@ames.arpa Reply-To: fouts@orville.nas.nasa.gov.UUCP (Marty Fouts) Lines: 67 Kevin Buchs asks why the ETA-10 is advertised at 375 MFLOP but does 10 MFLOP on Linpack, and if other machines such as the Cray 2 have the same problem. The answer to the second question is yes; most machines have a different peak number and "average" number. I believe that an EPA sticker needs to be put on (super)computer claims: 'Vendor factory calculations show a maximum performance of X Units. Use this number as a guide, your floppage may vary, according to programming style and problem conditions.' The advertised peak performance number is just that; peak performance. It is frequently refered to around here as the "guarenteed not to exceed this speed" number; and is usually obtained (for a supercomputer) by application of the following logic: The machine, when running in full blown vector mode can pump out 1 FLOP result per functional unit per N clock periods. (We try to make N 1 also ;-) It has M functional units which can be active simultaneously and a clock period of T nanoseconds. Therefore, if you have an application which can be coded to use every functional unit, and is entirely vector in performance you can achieve M / (N * T) FLOPS per second. This is the rate the vendor quotes. On a real application, this rate can be slowed by many things. First of all, your application isn't entirely vector adds and multiplies, it has to do other work. This leads to the vector/scalar trade off which Gene Amdahl loves so much -- Hot vector computers aren't nearly as hot as scalar machines. (A 10 to 1 time ratio is not uncommon) If you have a code which is 10% scalar and 90% vector on a machine which has the 10 to 1 performance ratio, that code is going to spend as much time doing scalar work as it does doing vector work. Secondly, even if your application is all vector, there is probably some architectural gotcha that will keep the machine getting peak performance, such as you really need 3 adds and 1 multiply, but the machine has 2 adders and 2 multipliers, so the multiplier is idle part of the time while the adder handles the extra work load. There are many of these kinds of gotchas, relating to vector length, number and type of functional units, and memory reference patterns. (On the Cray 2, it can take 1.5 times as long to reference the same memory bank twice in a row as it does to reference two successive memory banks.) Thirdly, there is the quality of the compiler technology. The better a compiler is at detecting optimizable code the better performance it can achieve. Originally, the Cray 2 C compiler would produce about 7 million whetstones and the Fortran compiler about 15 million. Now, the C compiler produces 11 and the Fortran compiler 20. (Dusty deck double precision in all four cases.) Finally, I/O can do you in. You might have a machine with a small physical memory and backing stores, such as SSD on an X/MP or virtual memory on a 205, and you have to keep moving your data between very fast main memory and not so fast backing store, so that your CPU can get to it. (Small is relative; X/MP 4-16 has 16 MWord = 128 MByte of main memory, which is big compared to a PC, but small compared to the 2048 MByte of memory on a Cray 2. The key feature is that all of the data being crunched doesn't fit.) And all of these things occur at a gross physical level, so that the programmer has to be painfully aware of them. I have written simple loops on the Cray 2 in C or Fortran which get 15 - 20 MFlops, which can be replaced by hand coded assembly and get 150-1200 MFlops. The bottom line is that the vendor reports the peak speed and Linpack, the Livermore Loops, the NAS kernels, Whetstone, et al. report how the vendor's compiler technology translates a particular algorithm into code to run the vendors architecture. My favorite pathological case is a C program I wrote which runs twice as fast on a Vax as on the Cray 2; simply because I coded for pathological behavior on the 2.