Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!gem.mps.ohio-state.edu!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: 55 MIPS & 66 MIPS (really, embedded & military benchmarking) Summary: analysis of another study Message-ID: <32528@winchester.mips.COM> Date: 1 Dec 89 03:50:56 GMT References: <31329@winchester.mips.COM> <1358@bnr-rsc.UUCP> <5275@omepd.UUCP> <32468@winchester.mips.COM> Lines: 275 This note: 1) Analyzes the Society of Automotive Engineers (SAE)'s final report "FINAL REPORT, 32 BIT COMMERCIAL ISA TASK GROUP, AS-5, SAE" ..... which came out in September or October, I think.(?) 2) Discusses an article representing the results of that report. The objective was: "the 32 Bit Commercial ISA Task Group was established to evaluate suitability of existing commercial architectures for use as general purpose processors in avionic and other embedded applications" The approach was to request applications from any vendor who wanted to propose things, and they got AMD29K, Intergraph Clipper, MIPS R3000, NS32000, Sun SPARC, and Zilog ZS80000. "A set of criteria were established and relative weights set." This was split into: 60%: functionality of the instruction sets (general) 20%: capabilities of the current implementation 20%: performance What this means is that there were a bunch of criteria, with points assigned by discussion of the committee, i.e., there could be 10 points for some section, and chips might be given anywhere from 2 to 8 points, then normalized to the maximum found, that is, the one with 8 would get 1 point, and the one with 2 would get 2/8 = .25. Totals were: "Results: 29000 R3000 32532 SPARC General 42.88 40.12 42.56 43.40 Current 10.89 13.52 13.65 13.86 Performance 4.90 14.50 10.92 16.00 Total: 58.67 68.14 67.14 73.26 Observations: The most significant point of the results is the very small spread of the point values." They go on to note that AMD didn't have an Ada compiler available at the time, and so got zapped on performance. They also note that they scaled up the scores for MIPS and SPARC because faster chips became available than what had been benchmarked. They noted the difficulty of establishing objective criteria, saying: "To this end, four meetings and the intervening months were devoted to establishing the criteria against which the ISAs would be evaluated. As in any other venture, if we were to start over, we would probably produce a somewhat different set of criteria, with results that might be more valuable in their ability to differentiate between the ISAs....It was also noted that when actual evaluation was started, the meaning of several of the criteria were obscure and had to be clarified. Conclusions: Since these ISAs, and their implementations, are competing in the market place, it is not surprising that none of the ISAs were exceptionally better or worse than any of the others...Due to there not being a typical application, it is not possible to make a definitive general recommendation. In general, any of the ISAs will serve well. Given a specific application, with its own priorities and constraints, one of the implementations will probably serve that purpose better than another." ************************* Thus, the outcome of the study, clearly stated, was: a) It's hard to create objective criteria. b) They cannot make any definitive recommendations of one over another. ************************* The next section gives the various details of rating points, for the first two categories. These were done by consensus scoring of features. For example, "Support for cache coherency AMD 29000 2 MIPS R3000 5 National 32000 2 Sun SPARC 8" (There are pages of such things; some of the numbers make sense, some are inexplicable to me, but that's OK. This particular one is somewhat inexplicable... Some of the ratings directly contradict the findings of people like JMI, whose C Executive runs on many micros, and which MEASURED things like interrupt-handling and context switching, rather than consensus-estimating them. Under "Current implementations", there were good things like: "How many compatible performance variations are available? AMD29000 1 MIPS R3000 3 National 32000 5 Sun SPARC 5" (Interesting: it doesn't matter whether an implementation covers a wide range of performance, what counts is the number of different ones. Note that the .4 difference (5/5 - 3/5 accounts for more than the full difference in the final ratings for this section.....) Finally, we come to the benchmark section, which contains additional ratings of the type above, plus one section for actual benchmarks. Sun SPARC is given 50 points (24.5 mips), and the R3000 39.1 (19.15 mips). I deleted the NS32532 column for space reasons, and added the data column at the right (which was the Ada compiler, -O, and whose results were available May 1989 and posted shortly thereafter (I think) on the JIAWG bboard by the TI folks. The benchmarks total 2200 Lines Of Code Ada), and are mixture of integer and floating point, as follows: bin_clst binning & clustering: 135 LOC, integer boomult multiplies boolean matrices together, 102 LOC des1 encryption, 346 LOC dig_fil 64-bit FFT, 647 LOC eightqueens integer, 98 LOC finite2 char->float conversions, 165 LOC flmult float matrix multiplication, 106 LOC inmult integer matrix mult, 81 LOC kalman flt/integer, matrices, 324 LOC shell shell sort, 52 LOC, integer substrsrch substring text search, 103 LOC Now, here is the data presented in the report, plus my addition of the last column: VAX 11/780 VAX 11/785 R3000 SPARC R3000 -O DEC DEC MIPS Inc. SUN MIPS 25 MHZ 25MHz 25MHz Times in millisec, followed by results in MIPS, Normalized to VAX 11/780 =1 (Note 3) bin_clst 0.51 0.48 0.05 0.08 0.04 boomult 981 658 246 49.99 29 des1 160 111 13.33 dig_fil 111000 2830 70 106.66 55 eightqueens 30 21 1.58 1.65 1.29 finite2 12 9 0.70 0.71 .60 flmult 765 429 81 65 24 inmult 789 495 104 53 kalman 480 330 57 51.66 27 shell 5 3.1 0.48 0.47 .31 substrsrch 12 9 0.65 0.55 .35 bin_clst 1.00 1.06 10.20 6.38 20.00 boomult 1.00 1.49 3.99 19.62 33.80 des1 1.00 1.44 12.00 dig_fil (note 3) 0.03 1.00 40.43 26.53 51.5 eightqueens 1.00 1.43 18.99 18.18 23.25 finite2 1.00 1.33 17.14 16.90 20.00 flmult 1.00 1.78 9.44 11.77 31.87 inmult 1.00 1.59 7.59 14.89 kalman 1.00 1.45 8.42 9.29 17.78 shell 1.00 1.61 10.42 10.64 16.13 substrsrch 1.00 1.33 18.46 21.82 34.28 Average 0.91 1.41 14.51 15.31 26.35 Average for 33MHz R-3000 and 40MHz SPARC 19.15 25.16 Note 3) dig-fil results are normalize (sic) to VAX 11/785 results. Data sources: VAX results provided by JIAWG/WPAFB R3000 results provided by TI SPARC results provided by Sun ------------------------------------------------------- ------------------------------------------------------- Now, here's a good exercise for the reader: what do you believe from the data above? What conclusions can you draw, and why? What problems might there be? 1. The benchmarks are very short: remember the times are in milliseconds, that is, numbers as low as 40 microseconds are listed. => benchmarks should be longer 2. There are holes in the data. The des1 entry for MIPS is missing (there was an obscure bug in the Ada front-end at that point). The inmult benchmark for Sun was missing, for reasons I don't know. It is very difficult to compute averages of data where it's missing, because some benchmarks are tougher than others, and if your best or worst benchmark gets left out, it can affect the results. (This is why it's so nice to have the SPEC benchmarks: it was always a pain getting a complete set of numbers for the MIPS Performance Brief)> => delete the rows that have missing data. 3. The average is an arithmetic average, NOT a geometric mean. (Geometric mean is a better measure for analyzing ratios.) => use geometric mean for averaging ratios. Also, one of the datapoints is normalized differently (to a 785). 4.If you compute the Geometric means, having deleted the two rows that are missing data, you get: MIPS: 12.63, SPARC: 14.36, MIPS (opt): 25.8. 5. Just scaling up clock rates is meaningless, computers don't work that way, because the memory systems are relevant. Suppose you give SPARC a 40MHz clock rate: that get's its geometric mean = 14.36x40/25 = 22.98, i.e., not as fast as the MIPS at 25MHz.... 6. Of course, the variance of all this data is pretty high: with 9 data points used, the 95% confidence levels for the 3 are: MIPS: [7, 23.5] SPARC: [10.6, 20.7] MIPS -O: [18.9, 36.4] Anyway, this is why the committee carefully said that the overall data didn't mean very much. Of course, the committee report came out AFTER the JIAWG decision was made [i.e., it was irrelevant to that], and this report explicitly did NOT recommend anything as the architecture for military projects. Lessons: 1) It's hard to evaluate things on paper. I think the committee tried hard, in a really difficult job, but it's real hard... 2) It's always a good idea to look behind the summaries a bit. 3) It's important to understand the difference between numbers than mean something, and numbers that don't. The committee did understand that there was insufficient difference to prove anything. Now, everyone interprets data a bit differently, Just for fun, let's look at how Frank Yien and Scott Thorpe of Sun interpreted this, in SunTech Journal, Autumn 1989, page ST8, in the article called: "SPARC Scores In DARPA/SAE Architecture Test" (THERE'S BEEN PLENTY OF DATA; NOW WE GET SOME "MARKETING" ANALYSIS; QUIT NOW IF YOU DON'T LIKE THAT STUFF. I INCLUDE THIS BECAUSE I'VE ALREADY GOTTEN QUESTIONS FROM PEOPLE ABOUT IT, AND THE ARTICLE HAS APPARENTLY GIVEN TO PEOPLE ABROAD TO PROVE THAT SPARC WAS SOMEHOW A U.S. RECOMMENDED STANDARD....) The article leads off with: "In a recent comparison of leading 32-bit architectures by DARPA (the Defense Advanced Research Projects Agency), the SPARC architecture was ranked as the top processor architecture for use in military projects." Well, it had the highest numbers, but they weren't significant, and the committee said so. Of course, it didn't matter much anyway, because the key decisions were being made somewhere else, and the choices elsewhere [MIPS & Intel] reflected what the large contractors decided in doing serious evaluations. "Finally, SPARC won the benchmark category, without using the most powerful SPARC implementations available from SPARC manufacturers today. The 80-MHz ECL SPARC implementation was not used in these comparisons;" Of course it wasn't; the embedded avionics market is not excited by ECL, and Sun didn't have an ECL system for them to benchmark anyway. So what does ECL SPARC have to do with it? "instead, the 40-MHz CMOS SPARC implementation was benchmarked and still won easily, since the others have only 33-Mhz chips." They didn't benchmark a 40-MHz implementation, they benchmarked a 25Mhz one and then multiplied by 40/25. Note that no 40Mhz SPARC SYSTEM has yet been announced, much less delivered. It didn't win easily, it won barely @ 25Mhz, and if they had reported the correspondingly-optimized MIPS numbers, a 40MHz SPARC (not yet delivered in system) is seen from the chart above to be SLOWER than a 25MHz R3000 [slower on the average, and slower on 8 out of the 11 benchmarks, the only exceptions being eightqueens, finite2, and shell, hardly the larger/more realistic tests]. "note that military benchmarks are very demanding and closely resemble compute-intensive engineering/simulation environments." Military benchmarks can be demanding all right, but: some of these are very small benchmarks. Some of these benchmarks are realistic, and some are pretty small; none have any real-time component that I could see. If you believe there's a correlation between these benchmarks and engineering ones, that's good, because MIPS is faster. If you don't believe there's much correlation, that's fine too.... "SPARC is winning the technology battle: It is the frequency leader in both CMOS and ECL technologies and ranks first in independent tests. SPARC hardware and software vendors are well positioned for the future." Well, each to their own.... Note that the real war for the 32-bit RISC embedded defense standard seems to have 2 winners, and SPARC wasn't one of them.... It's possible that some people missed this, although it sure made the defense magazines... -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 Brought to you by Super Global Mega Corp .com