Path: utzoo!attcan!uunet!husc6!bloom-beacon!mit-eddie!ll-xn!ames!pioneer!eugene From: eugene@pioneer.arpa (Eugene N. Miya) Newsgroups: comp.arch Subject: Re: benchmarks Message-ID: <8734@ames.arc.nasa.gov> Date: 13 May 88 20:15:58 GMT References: <30872@amdahl.uts.amdahl.com> <3460014@hpsrla.HP.COM> <2175@winchester.mips.COM> <31755@amdahl.uts.amdahl.com> Sender: usenet@ames.arc.nasa.gov Reply-To: eugene@pioneer.UUCP (Eugene N. Miya) Organization: NASA Ames Research Center, Moffett Field, Calif. Lines: 86 In summary from my paper: What makes good benchmarks: You require reproducibility, comprehendability, and you would like simplicity. In fact, benchmarks are too simple. Machines are becoming more diverse: multiprocessors of different architectures, smart software, etc. The simple linear model of measurement: take start time, do work, take stop time is susceptable to these influences. What you want are strict controls of the pre-test condition, the test condition, and the post-test condition. Also important is the during test condition. One of the biggest challenges to performance measurement is parallelism: the mythical MIPS problem: confusing work and effort as Brooks would say. SO linearity is something we have to fight. Our problem is oversimplicity. Test data cases must be carefully selected. Do you execute a significant (not necessarily most) portion of the code? Data can be as important as the program itself. Don't even thing about asking about interactive characterization. It's mostly a joke. (Pardons to those reading who did their PhD thesis on characterizing interactive systems). We have to know when hardware/software is being benficial or detrimental. We must throw off the idea that a program is an experiment [Feigenbaum?]; it is not. A single program measurement lacks the experimental controls necessary for good measurements. We have synthetic as well as real program benchmarks. The former are usually derailed as unrealistic, but the problem is our concept of the execution of a program is too simplistic. We talk of memory-CPU benchmarks when most of the time is spent doing I/O. We have few IO benchamrks. There's no such thing as a standard application. We need to do is run programs over applcations, then take the resultant data to form a performance prediction. (notice we are never told expected Linpacks only measured Linpacks). My survey covers the shortest and longest synthetic programs I know. Some syntactic analysis, some dynamic, etc. The problem is we treat machines and benchmarks like black boxes. We expect near worthless single figures of merit to select machines (I wonder what types of cars there people buy). I'm interested in starting a new "study" call it computer cardiology. We must systematize the measurement process. I'm talking to one special software house on a benchmark test program generator and am working on prototypes, now in my spare time. The ideal measurement tool must have a high degree of portability. It must be reasonably simple, the analysis portion must be seprable from the measurement portion. Unfortunately, most machine make poor measurement environments: IBM370s, VAXen, Mac, PCs. Quantity does not make something good, my standalone time on our X-MP has been curtailed because our users also need the machine. Cray-2s don't have HPMs. The ideal tools should allow one to vary parameters carefully, one at a time. Linpack while a nice appearing simple single figure of merit has diminishing parallelism (since it's a direct solution). Any 32-bit result should be viewed with suspicion (it is a 64-bit test). It's value is that Jack Dongarra dares to name names, does not fear getting sued for holding damning numbers, nor does Rick Richardson for that Dhrystone matter. We want the computer equivalent of the pocket tape measure. Computers really don't differ in fundamental construction all that much currently (well, Multiflow, CM[12], DAP, etc.). These represent new challenges for becnhmarking. No, the paper does not read like this, it's being typed "stream of consciousness" in during the "heat of passion." There is a mailing list devoted to performance measurement (@cs.wisc.edu). But they are mostly queueing theorists, not benchmarkers. Largely quiet, after all SIGMETRICS'88 is what next week? P.S. Don't ask me for a copy yet, I will announce availability, I've promised far too many people and get side-tracked too often. I have a shorter paper which is undergoing review on a tiny aspect of the bigger pictures, but I have to send a copy of the bigger one to John, Chuck, and lots of others. Don't worry. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."