Xref: utzoo comp.arch:23385 comp.benchmarks:700 comp.lang.fortran:5741 comp.lang.c:40290 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!validgh!dgh From: dgh@validgh.com (David G. Hough on validgh) Newsgroups: comp.arch,comp.benchmarks,comp.lang.fortran,comp.lang.c Subject: Suggestions for SPEC 3.0 CPU Performance Evaluation Suite Message-ID: <403@validgh.com> Date: 21 Jun 91 02:47:09 GMT Followup-To: poster Organization: validgh, PO Box 20370, San Jose, CA 95160 Lines: 438 When the idea that became SPEC first started circulating I was among the many that agreed that it would be good if somebody, somewhere did all the work necessary to establish an industry standard performance test suite to super- sede *h*stone and the many linpacks, which had outlived their usefulness in an era of rapid technological change. Fortunately a few somebodies somewhere did get together and do the work, and the SPEC 1.0 benchmark suite has been a tremendous success in re-orienting end users toward realistic expectations about computer system performance on realistic applications, and in re-orienting hardware and software designers toward optimizing performance for realistic applications. In that spirit I'd like to suggest some changes for consideration in SPEC 3.0, the second generation compute-intensive benchmark suite. Many of the suggestions come from study of the Perfect Club benchmarks and procedures, which are more narrowly focused than SPEC, primarily on scientific Fortran programs. Why does SPEC need to publish a new 3.0 suite just as 1.0 is getting well established? Because the computer business is an extremely dynamic one, and performance measurement techniques have lifetimes little better than the pro- ducts they measure - a year or two! Reporting Results In addition to the mandatory standard SPEC results which permit changes to source code solely to permit portability, SPEC should also permit optional publication of tuned SPEC results in which applications may be rewritten for better performance on specific systems. In the spirit of SPEC, publication of tuned results must be accompanied by listings of the differences between the tuned source code and the portable source code. If these differences are so massive as to discourage publication, perhaps that's a signal to the system vendors that they've been unrealistic in tuning. SPEC previously allowed publication of results for source codes enhanced for performance. This was a mistake because it was not accompanied by all the specific source code changes! All confirmed SPEC results must be reproducible by unassisted independent observers from published source codes and Makefiles and commercially available hardware and software. These two types of results - on portable programs and on specifically tuned programs - correspond to two important classes of end users. Most numerous are those who, for many reasons, can't or won't rewrite programs. Their needs are best represented by SPEC results on standard portable source code. More influential in the long run, but far fewer in numbers, are leading-edge users who will take any steps necessary to get the performance they require, including rewriting software for specific platforms. Supercom- puter users are often in this class, as are former supercomputer users who have migrated to high-performance workstations. Arguing the legitimacy of rewrites by system vendors would be a black hole for the SPEC organization. Allowing rewrites under public scrutiny leaves the decision about appropriateness to the leading-edge end users who would have to make such a determination anyway. Requiring tuned SPEC results to always be accompanied by portable SPEC results and by the corresponding source code diffs reminds the majority of end users of the cost required to get maximum performance on specific platforms. Just as tuned SPECstats should never be confused with portable SPECstats, projected SPECstats for unreleased hardware or software products should never be confused with confirmed SPECstats. A confirmed SPECstat is one that can be reproduced by anybody because the benchmark sources and Makefiles are avail- able from SPEC, and the hardware and software are publicly available. Nor should SPECstats computed from SPEC 3.0 be confused with those com- puted from SPEC 1.0. All SPECstats should be qualified with an identification of the SPEC suite used to compute them. The calendar year of publication is easiest to remember. Thus integer performance results derived from SPEC 3.0 benchmarks published in 1992 should be identified: SPECint.92 confirmed from portable source SPECint.92.projected projected from portable source SPECint.92.tuned confirmed from tuned source, diffs attached SPECint.92.tuned.projected projected from tuned source, diffs attached Similarly for SPECfp. I suspect the overall SPECmark has outlived its useful- ness - there is no reason to expect SPECint and SPECfp to be closely corre- lated in general. Otherwise there would be no need to measure both. If any circumstance warrants publishing just one SPECmark, let it be the worst SPECratio of all the programs. Defining "floating-point-intensive application" and "integer application" is an interesting problem. If floating-point operations constitute less than 1% of the total dynamic instruction count on all platforms that bother to measure, that's surely an integer application. If floating-point operations constitute more than 10% of the total dynamic instruction count on all plat- forms that bother to measure, that's surely a floating-point application. Intermediate cases may represent important application areas; these should not be included in SPEC 3.0 however unless at least three can be identified. spice running the greycode input could be the first. Should these mixed cases be included in SPECint.92, or SPECfp.92, or form a third category SPEC- mixed.92? All SPECstats.92 should include an indication of the dispersion of the underlying set of SPECratios used to compute the SPECstat.92 geometric mean. It is a feature of modern high-performance computing systems that their rela- tive performance varies tremendously across different types of applications. It is therefore inevitable, rather than a defect in SPEC, that a single per- formance figure has so little predictive power. This means a single number should never be cited as a SPECstat.92. SPEC requires that SPECmark results be accompanied by SPECratios for each test. This is an important requirement, but it is not realistic to expect every consumer of SPEC results to absorb 30 or more performance numbers for every system. Some additional simple means of representing dispersion is war- ranted. One very simple method is to quote the range from the worst SPECratio to the geometric mean: SPECint.92 = 15..21 means that the worst SPECratio was 15, the geometric mean was 21. The worst ratio is much more likely to be achieved on realistic problems than the best, which is why I don't see much value in quoting the latter except as part of the list of all the SPECratios. A more complicated way to represent dispersion, based on standard devia- tions of the logs of SPECratios in order to produce a SPECstat of the form 21 + 3, is discussed later. - Summary of Reporting Format Thus I propose that SPEC 3.0 results be reported in the form SPECint.92 = WI..UI SPECfp.92 = WF..UF W = worst SPECratio, U = geometric mean of SPECratios. The complete list of portable SPECratios follows. Optionally after that, SPECint.92.tuned = WI..UI SPECfp.92.tuned = WF..UF followed by the complete list of tuned SPECratios AND the complete list of source differences between tuned and portable source. Add another column for SPECmixed.92 if desired, and for SPECmark.92 if retained. Floating-point Precision SPECint and SPECfp are currently recognized as subsets of SPECmark. Should single-precision floating-point results be treated separately from double-precision? How should SPECfp be reported on systems which whose "sin- gle precision" is 64-bit rather than the 32-bit common on workstations and PC's? Although 64-bit computation is most common as a safeguard against roundoff, many important computations are routinely performed in 32-bit single precision with satisfactory results. To bypass these issues SPEC Fortran source programs declare floating-point variables as either "real*4" or "real*8" - no "real" or "doubleprecision". To be meaningful to end users, SPEC source codes would ideally allow easily changing the precision of variables, and vendors would be allowed and encouraged to treat working precision like a compiler option, using the best performance that yields correct results - of course documenting those choices. I know from experience, however, the great tedium of adapting source codes to be so flexible; and such flexibility also requires greater care in testing correctness of results. Verifying Correctness Astounding performance computing erroneous results is easy but not very interesting. Correctness verification is ideally an independent step that is not timed as part of the benchmark. Somewhat in contradiction, it is highly desirable that the correctness be in terms meaningful to the application. For physical simulations, appropriate tests of correctness include checks that physically conserved quantities such as momentum and energy are conserved com- putationally. Consider the linpack 1000x1000 benchmark as an example, because it's easy to analyze rather than because it's appropriate for SPEC 3.0. The rules require you to use the data generation and result testing software provided by Dongarra, but you may code the computation any reasonable way appropriate to the system. Correctness is determined by computing and printing a single number, a normalized residual ||b-Ax||, that depends on all the quantities x computed in the program - thus foiling optimizers aggressively eliminating dead code. How does one determine whether a residual is acceptable? Unfortunately that question can only be answered by the designers or users of the applica- tion. In this respect the linpack benchmark is obviously artificial because there really is no a-priori reason to draw the line of acceptable residuals at 10, or 100, or 1000... it depends on the intended use of the results. If the correctness criterion were established in absolute terms (rather than relative to the underlying machine precision as linpack's normalized residual does) then there would be no harm in rewriting programs in higher precision and avoiding pivoting, if that produced acceptable results and improved performance. The difference between absolute correctness criteria and criteria rela- tive to machine precision reflect differences between requirements placed on complete applications and requirements of mathematical software libraries. The complete application typically needs to compute certain quantities to some known absolute accuracy. Mathematical software libraries, like the Linpack library from which the well-known benchmark was drawn, will be used by many applications with many differing requirements not known in advance, so their quality should be the highest reasonably obtainable with a particular arith- metic precision, and thus is best measured in units of that precision. General Content SPEC 3.0 benchmarks should time important realistic applications, as com- plete as portability permits, from whose performance users may reasonably pro- ject performance of their similar applications. Benchmarks should be independent in this statistical sense: it should not be possible statistically to predict the performance of one SPEC benchmark with any accuracy across many SPEC member platforms based upon the known per- formance of some other disjoint subset of SPEC benchmarks on those platforms. As long as important realistic applications are chosen that can be reasonably verified for correctness, and this independence criterion is satisfied, I see no need to arbitrarily limit the number of SPEC computational benchmarks. In addition to the current SPEC 3.0 candidates, I recommend to SPEC for future consideration the Fortran programs collected by the PERFECT Club. Aside from spice2g6, which has a different input deck, they seem mostly independent of current SPEC 1.0 programs. A number of the gcc and espresso subtests run too fast to be timed accu- rately. They should be replaced by more substantial ones. Specific Comments - matrix300 The matrix300 benchmark has outlived its usefulness. Like linpack before it, it has forced adoption of new technology that has in turned made it obsolete, for it is now susceptible to optimization improvements seldom observed in realistic applications. Amazing performance improvements have been reported by applying modern compiler technology previously reserved for vectorizing supercomputers: System Old SPECratio New SPECratio IBM 550 100 730 HP 730 36 510 Competitive compilers should indeed exploit such technology, but it does end users no good to suggest that many realistic applications will subsequently show 7X-14X performance improvements. Such results are simply an artifact of this particular artificial benchmark, and demonstrate how misleading it is to present SPEC 1.0 performance with one number. Inasmuch as the SPEC 1.0 ver- sion of matrix300 does not actually report any numerical results, the entire execution could legitimately be eliminated as dead code, although so far nobody has exhibited that much temerity. While the matrix multiplication portion of some realistic applications can and should demonstrate significant improvements, the overall application's improvement will be tempered by the portions that aren't susceptible to such optimizations. nasa7 includes a matrix multiplication kernel, but the spirit of SPEC is much better served by incorporating into SPEC 3.0 certain proposed realistic applications of which matrix multiplication is one important com- ponent among others. Specific Comments - nasa7 Nasa7 consists of the kernels of seven different important computational applications. As such it was much more realistic - because its kernels were more realisticly complicated - than the livermore loops which it has largely supplanted. Each of the specific types of applications - involving matrix multiplication, 2D complex FFT, linear equations solved by Cholesky, block tridiagonal, complex gaussian elimination methods, etc. - should be represented separately by realistic applications rather than somewhat arbi- trarily lumped into one benchmark: the repetition factors for the seven ker- nels are 100, 100, 200,20,2,10,400, which may represent the relative loads at NASA but probably not elsewhere. And they make the run time fairly long. Specific Comments - doduc There is one troubling aspect of the doduc program: it lacks any good test of correctness other than the number of iterations required to complete the program, which number might not be a very reliable guide. For instance, if the simulated time is extended from 50 to 100 seconds, the number of itera- tions appears to vary by 20% (20,000 - 24,000) among systems which appear to behave similarly in shorter runs, casting doubt on the correctness of the shorter runs. doduc is an interesting and valuable benchmark that should be retained in SPEC 3.0 if a more confidence-inspiring correctness criterion can be devised. Specific Comments - spice The greycode input deck doesn't seem to correspond to any very common realistic computations, and takes a long time to run as well. A number of oth- ers have been proposed; SPEC 3.0 should include a number of them as spice2g6 subtests. In addition I urge SPEC to consider, when opportunity permits, the spice3 program from UCB. It is unusual - a publicly available, substantial scien- tific computation program, written in C. It accepts most of the input decks that spice2g6 accepts. Specific Comments - gcc gcc 1.35 represents relatively old compiler technology suitable for CISC systems based on 80386 or 68020, for instance. gcc 2.0 is designed to do the kinds of aggressive local optimizations required for RISC architectures - such as most of the hardware platforms sold on the basis of their SPECmarks. I encourage SPEC to replace gcc 1.35 with 2.0 as soon as the latter is available for distribution. In addition I urge SPEC to consider the f2c Fortran-to-C translator from ATT. It is another publicly available, substantial program written in C, with many of the same kinds of analyses that a full Fortran compiler performs. SPECstat Computations SPECint and SPECfp are geometric means of ratios of elapsed real times of realistic applications. That's the correct approach. I would handle the cases of multiple subtests somewhat differently than SPEC 1.0 does: for gcc and espresso, and perhaps spice2g6 and spice3 in the future. Currently the run times of subtests are added up to get an overall execution time. For the same reason that the geometric mean of several tests is appropriate for the overall SPECmark, the geometric mean of the SPECratios of the subtests is the appropriate SPECratio for that test. Thus instead of adding up the times for all the 8 espresso inputs and comparing those to the sum of the 8 times on the reference system, I'd compute the SPECratio for each espresso input, compute the geometric mean of those 8 SPECratios, and use that as the SPECratio for the espresso benchmark when computing the overall SPEC- mark. SPECstat.92 Reference Times The VAX 780 is rapidly disappearing but is anything but rapid in running SPEC programs. For convenience, the reference system for SPEC 3.0 should be widely available and as fast as possible. One could choose candidate refer- ence platforms on the basis of SPECmass, the performance equivalent of biomass: SPECmass = SPECstat * installed base. On that basis one of the SPARCstations might be selected, but it doesn't matter too much - any recent widely available RISC Unix workstation would do. The reference result would be the best elapsed time achieved on the reference system by the time SPEC 3.0 was announced, using any combination of compiler and operating system produc- ing correct results. The results for the reference system would be remarkably balanced - all SPECratios equal to 1 - which might appear to be to the advantage of the ven- dor of the reference system, but any such advantage is illusory. With product lifetimes of a year or so, the reference system - which by definition has a large installed base and therefore is near its end of life - would be out of production during most of the lifetime of the SPEC 3.0 suite, and any replace- ment products from that vendor would likely have SPECratios that would be far from uniform. The concept of "system balance" exists mostly in marketing science any- way; the same computer that has no outstanding bottlenecks in one environment may be limited by integer, floating-point, memory, i/o, or graphics perfor- mance in others. If SPEC needs politically to avoid choosing one particular reference system, it could compromise by choosing several of roughly compar- able integer performance, using one for integer benchmark reference results, one for floating-point reference results, etc. To avoid intentional or accidental confusion between SPEC 1.0 SPECstats and SPEC 3.0 SPECstats.92, it's desirable to recalibrate SPECstats.92. If the SPARCstation 2 were chosen as the SPEC 3 reference, for instance, ignoring the effects of using a different suite of benchmarks, then SPECstat.92 would be immediately deflated by a factor of about 21 relative to the SPEC 1.0 SPECstat, reducing opportunities for confusion. Why Geometric Mean is Best for SPEC The progress of some end users is limited by the time it takes a fixed series of computational tasks to complete. They then think about the results and decide what to do next. The appropriate metric for them is the total elapsed time for the applications to complete, so the arithmetic mean of times is the appropriate summary statistic. If rates, the inverse of times, happen to be available instead, the appropriate statistic is the harmonic mean of rates. If application A runs ten times as long as application B, then a 2X improvement in application A is ten times as important as a 2X improvement in application B. Other computational situations are characterized by a continual backlog of processes awaiting execution. If the backlog were ever extinguished, the grid densities would be doubled and saturation would again result. In these cases the appropriate metric is rates - computations per time - and the appropriate summary statistic is an arithmetic mean of rates, or if times are available, a harmonic mean of times. A 2X improvement in application A is just as important as a 2X improvement in application B. What about the commonest case consisting of workloads of both sorts? With geometric means of SPECratios, the conclusions are the same whether rates or times are used, and rate data and time data may readily be combined. That's why I like to use the geometric mean to combine SPECratios of diverse types of programs. As with benchmarks themselves, the most appropriate way to combine bench- mark results varies among end users according to their situation. SPEC has wisely chosen the most neutral way to combine results - while requiring that individual results be available as well. Another Way to Represent Dispersion of SPECratios Accompany every SPECmean computed by geometric mean of SPECratios by a + - tolerance representing the dispersion in the set of SPECratios used to compute the geometric mean. Inasmuch as geometric mean of SPECratios is the exponen- tial of the arithmetic mean of the logs of SPECratios, the tolerance could be computed from the standard deviation s of the logs of the SPECratios in this way: u = mean(log(SPECratios)) s = standard deviation(log(SPECratios)) U = exp(u) S = U*(exp(2*s) - 1) round U to nearest two significant figures round S upward to same number of decimal places as U SPECmean = U + S - Thus I would summarize the results of some recent experimental compiler tests as SPECmark = 20 + 2 - SPECint = 19 + 2 - SPECfp = 21 + 1 - Such a + presentation emphasizes the futility of buying decisions based on - insignificant SPECmean differences in the third significant figure, and may therefore help focus system vendor efforts on improving the worst SPECratios instead of the best. Strictly speaking statistically, the SPECmean would be exp(u) + exp(u)*(exp(2*s)-1) - exp(u)/(exp(2*s)-1) but simplicity recommends the earlier formulation. -- David Hough dgh@validgh.com uunet!validgh!dgh na.hough@na-net.ornl.gov