Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!usc!apple!agate!eos!eugene From: eugene@eos.arc.nasa.gov (Eugene Miya) Newsgroups: comp.benchmarks Subject: More issues of benchmarking Summary: sorry for a long posting, but I have raw data inside Keywords: benchmarking balance, Message-ID: <7601@eos.arc.nasa.gov> Date: 21 Nov 90 11:08:03 GMT Reply-To: eugene@eos.UUCP (Eugene Miya) Organization: NASA Ames Research Center, Calif. Lines: 261 I hope it is okay that I defer responding to what Rafael and the DG fellow mentioned about setting a reference machine aside. I have certainly thought about those issues in the past and I will get to them (honest) actually part of the reason appears below. I don't think I will have time to post again this week (I am actively trying to benchmark, at XXX, XXXX, XXXXXX, XXXXXX, and X-X 8^) due to the short week and that I am away from my office. Instead, let me present some issues having to do with metrics, how to take a simple one, some of the intrusive forces which act on them, and show some results. This is an intentionally naive survey. There's lots of metrics: FLOPS (the joke was "When does a FLOPS become a success?" [humor out of the way]) and its variants MFLOPS, GFLOPS, TFLOPS, and I just saw EFLOPS in SC'90), but also MIPS (BIPS or GIPS), counts of specific arbitrary units of work: Whet|Dhry|NFS|IO|LHYNE|KHORNER|iHnsert your own STONES RhollingStones......(at NBSLIB) But also vaguer things like "Logical Inferences Per Second." (LIPS) If you see a good measure of the latter, I want to talk to you. I have no qualms of the Megas, Gigas, Teras, Etas, peta-prefixes. I have no qualms about the "Second." I, and many others, do have problems counting intructions, floating-operations (while ignoring other instructions). Yes, it must be admitted early, "the only real measure is how fast you get your solution done." If that is all you see, and all you care about is buying machines, stop reading here and go to the next post. If you have continued, I hope, it is because you are interested in other purposes of benchmarking to understand how to make machines faster. I have come to some conclusions that raw timing and limited counts are the most useful measures. Yes, they are too simple, but I have to try and build a framework from the bottom up to understand benchmarking. It is possible to definite any number of metrics, and write any number of programs (I have a paper under review which one co-worker likes my line "Any program can be a benchmark, but good benchmarks are hard to find."). Sure, "A metric is defined by a function d() which maps elements on a space X...," goes on to prove d maps to the positive real numbers, "triangle inequality," I want to toss that aside. I'd like to use a different, more empirical approach. As a former mathematician, one of the common techniques of proof by induction typically begins: If X exists, then there must be a smallest representative of X. X here is a benchmark. We's like it to be representative or something we do (computation). So I began s search for "the smallest benchmark." Smallest has lots of advantages: carry it anywhere, solves the portability problem (simplicity problem, sort of), ignoring the database of results and changing time, no card images to carry, etc. The smallest (fewest characters) of any benchmark I know is in APL. Pardon my APL 256 (APL iota) ...(I have it written precisely somewhere) Basically create a vector of elements 1..256 and sum them (Gaussian sum). 255 additions. Time this. Also illustrates the very first example of optimization I know: when Karl F. Gauss "found" n(n+1)/2. This optimization in a compiler was reported to me by a person in the APL community who wishes to remain anonymous, because he did this optimization prank, yes he did it as a prank, on a user who used the above benchmark (doing a multiply, divide and one add instead of 255 adds). This is the shortest Unix benchmark I know: echo 99k2vp8opq | /bin/time dc > /dev/null Time the arbitrary precision desk calculator in Unix to compute 99 digits of the square root of 2. ("CPU benchmark") Now you learned something. Try it. Suspend or move to another window and you, too can run a benchmark {Pause exercise for reader. Users without BSD style Job control can escape to the shell using '!sh' or '!csh' non-Unix users, so sorry} I thought about collecting timings (see a problem ahead). It's found in Musbus among other places. We used it for some folk at Bell Labs when we first got an X-MP/12 (actually Cray-1M). Examples: VAX-11/780 (convenient): csh> echo 99k2vp8opq | /bin/time dc > /dev/null 5.8 real 5.3 user 0.2 sys csh> uptime 1:28am up 2 days, 18:29, 2 users, load average: 0.20, 0.06, 0.06 SGI Iris 4D/60 (just me): echo 99k2vp8opq | /bin/time dc > /dev/null real 1.9 user 1.2 sys 0.2 Cray Y-MP/8128 (heavily loaded): echo 99k2vp8opq | /bin/time dc > /dev/null seconds clocks elapsed 0.62413 104021600 user 0.36663 61104464 sys 0.03782 6302641 A large 32-bit mainframe (in use, maybe 12 users): echo 99k2vp8opq | /bin/time dc > /dev/null real 0.5 user 0.3 sys 0.0 You see some of the problems benchmarkers have to go thru (like formatting). This took about 5 minutes of mousing around. It would have been much worse if I had to deal with different operating systems, different timing techniques, and different languages. These are all called "CPU-memory" intensive measurements. They certainly minimize I/O but they ignore any potential paging, etc. You can knock these programs (too small, too simple, too easily optimized [Gauss himself]). They are easily transported black boxes. They can be run so fast their actual execution time (or the execution time of any benchmark) improves too well to be a good measurement tool (converging on 0.0 seconds). You may think I am presenting a straw man, but I am trying to show you that "play" is an important part of benchmarking. We cannot explain how the different machines above behaved. We took timings, but they can raise more questions (*Why is the mainframe so close to the Cray? Sure, the Cray is a fast word-oriented, floating-point engine. How do you get bytes? mask out portions of the word (expensive). I think it is important that students of benchmarking have experience on machines with some architectural diversity, so that they can see some hardware (and software) artifacts {this is over simple}. If students don't, if they stay on VAXen, IBM PCs or the smaller mainframes, they will form some very bised views. e.g., I am concerned about benchmarking data flow machines. *) So suppose you are very fed up with the above "dc" benchmark (I think from Bell Labs). You go off and naively write you own. Let's suppose you have heard the term MFLOPS or MIPS as a metric: Millions of floating point operations per second or millions of instructions per second. If you constrain yourself to high-level languages (at least in these days), again, you should have a smallest and simplest. You know what a Million is. Suppose you wrote Fortran and timed: C This is called a Pre-test initialization condition X=1.0 Y=1.0 Then timed just: Z=X+Y /* for C programmers*/ float x=1.0, y=1.0, z; and z = x + y; /* 8^) */ Rmember let's be naive a moment, we don't know about registers, caches (I'm going to run on a Cray, to those unfamiliar with Crays it is an exercise for the reader to determine cache size), virtual memory, etc...... If I run a million of these, that's a MFLOPs right? [Additive linear models.] Using an integer real time clock I get: 1 STOP (called by $MAIN ) CP: 0.001s, Wallclock: 0.002s, 4.3% of 8-CPU Machine HWM mem: 98502, HWM stack: 2048, Stack overflows: 0 That's 1 clock period. This is a 6.0 nanosecond machine. If I substitute a floating point second clock, I get 2.8800000000002E-6 STOP (called by $MAIN ) CP: 0.001s, Wallclock: 0.003s, 3.2% of 8-CPU Machine HWM mem: 98505, HWM stack: 2048, Stack overflows: 0 That's not 6.0 nanoseconds. Consider the following, if I had timed that on a VAX (VMS or Unix) the time would have been zero (or very close, it will not always be zero unless you synchronize your clock calls). We assume several things are invariant: variable names, data types, spaces are insignifcant to performance (this is not always the case). If we had a 780 we would have to resort to some tricks to make a significant "tick." If I have 1 MFLOP machine why can't I run the following in 1 second (assuming an unloaded machine) [Fortran version]: X = 0.0 10 X = X + 1.0 IF(X.LE.1000000.0) GO TO 10 C equivalent C is left as another reader exercise That's a million floating-point additions (operations). Right? Oh, so you see the comparison, is that an FLOP? Many people don't think of that. Suppose it is. We cut the 1000000 to say 500000. We see that portions of of measurement can interfere (I will make a distinction later between intrusion and interference). Remember we are trying to keep this small. Did you know some architectures can execute A*B+C very fast (nearly as fast as a single + or *)? Why aren't we assuming time(+) == time (*)? I have seen one FP measurement based on a machine which I cannot name for non-disclosure, they gave a sample line of code which included not only +,* to get fastest speed, but also trig functions (I have that some where). That was the way to attain maximum speed on that machine. There is another interference problem with that loop. A smart compiler might eliminate the expression by a compile time evaluation. Such loops are common when some programs repeat some unit of work many times so that their low-resolution clocks can record some time (you then divide by the number of repetitions to get an average rate [ignores an initial page fault]. So some of this stuff is subtle,and it's why we need to start thinking about measurement equivalence. As a preview of advanced (play) issues to come, I offer: It helps to have cycle time clocks, but non-intrusive (or minimally intrusive) hardware monitoring tools will be needed. The real estate is very expensive. Again, the Y-MP and the X-MP are good for this. Can I post an hpm(1) man page without violating CRI's copyright? I suspect not. But I think I can post simple output, to give you an idea what it does, and we can get into details later. This is an output for a simple FORTRAN program: STOP (called by $MAIN ) CP: 0.001s, Wallclock: 0.002s, 4.1% of 8-CPU Machine HWM mem: 97697, HWM stack: 2048, Stack overflows: 0 Group 0: CPU seconds : 0.00 CP executing : 193512 Million inst/sec (MIPS) : 44.30 Instructions : 51437 Avg. clock periods/inst : 3.76 % CP holding issue : 42.81 CP holding issue : 82845 Inst.buffer fetches/sec : 0.77M Inst.buf. fetches: 897 Floating adds/sec : 0.21M F.P. adds : 246 Floating multiplies/sec : 0.23M F.P. multiplies : 267 Floating reciprocal/sec : 0.05M F.P. reciprocals : 54 I/O mem. references/sec : 0.00M I/O references : 0 CPU mem. references/sec : 14.72M CPU references : 17092 Floating ops/CPU second : 0.49M For a C program of equivalent functionality: Group 0: CPU seconds : 0.00 CP executing : 35317 Million inst/sec (MIPS) : 46.83 Instructions : 9923 Avg. clock periods/inst : 3.56 % CP holding issue : 43.34 CP holding issue : 15308 Inst.buffer fetches/sec : 0.66M Inst.buf. fetches: 140 Floating adds/sec : 0.00M F.P. adds : 1 Floating multiplies/sec : 0.00M F.P. multiplies : 0 Floating reciprocal/sec : 0.00M F.P. reciprocals : 0 I/O mem. references/sec : 0.00M I/O references : 0 CPU mem. references/sec : 17.20M CPU references : 3645 Floating ops/CPU second : 0.00M ls -l on the executables: -rwxr-xr-x 1 eugene xxx 717264 Nov 20 23:55 xxx -rwxr-xr-x 1 eugene xxx 108128 Nov 20 23:56 xx You would probably be shocked what these programs do for so much storage. I should compile VAX and SUN equivalents. (I will tell in time.) VAX: -rwxr-xr-x 1 eugene 25600 Nov 21 02:29 xxx -rwxr-xr-x 1 eugene 4096 Nov 21 02:31 xx SUN /* can't get the fortran compiler on my sun, NFS down*/ -rwxr-xr-x 1 eugene 24576 Nov 21 02:39 xx If you don't see the concept, it doesn't exist --Arthur Pyster paraphasing Gerge Orwell. We need more tools like the HPM. Software alone will not help the performance measurement problem. Students MUST have exposure to tools of this type. We need to develop frames of reference and calibration tools to better understand our benchmarks and what they are doing. Then you too can do a `man hpm`. Let me investigate more specific, less naive details next week. Whole bunch of issues to cover. --e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov {uunet,mailrus,most gateways}!ames!eugene AMERICA: CHANGE IT OR LOSE IT. It's 3 AM and I need to get at least 2 sleep periods. Resident cynic, Rock of Ages Home for Retired Hackers