Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!gem.mps.ohio-state.edu!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: 55 MIPS & 66 MIPS (really, embedded & military benchmarking)
Summary: analysis of another study
Message-ID: <32528@winchester.mips.COM>
Date: 1 Dec 89 03:50:56 GMT
References: <31329@winchester.mips.COM> <1358@bnr-rsc.UUCP> <5275@omepd.UUCP> <32468@winchester.mips.COM>
Lines: 275

This note:
	1) Analyzes the Society of Automotive Engineers (SAE)'s final report
"FINAL REPORT, 32 BIT COMMERCIAL ISA TASK GROUP, AS-5, SAE" .....
which came out in September or October, I think.(?)
	2) Discusses an article representing the results of that report.

The objective was: "the 32 Bit Commercial ISA Task Group was established
to evaluate suitability of existing commercial architectures for use as
general purpose processors in avionic and other embedded applications"

The approach was to request applications from any vendor who wanted to
propose things, and they got AMD29K, Intergraph Clipper, MIPS R3000,
NS32000, Sun SPARC, and Zilog ZS80000.  "A set of criteria were established
and relative weights set."  This was split into:
	60%: functionality of the instruction sets (general)
	20%: capabilities of the current implementation
	20%: performance

What this means is that there were a bunch of criteria, with points assigned
by discussion of the committee, i.e., there could be 10 points for some
section, and chips might be given anywhere from 2 to 8 points,
then normalized to the maximum found, that is, the one with 8 would get
1 point, and the one with 2 would get 2/8 = .25.  Totals were:

"Results:
		29000	R3000	32532	SPARC
General		42.88	40.12	42.56	43.40
Current		10.89	13.52	13.65	13.86
Performance	 4.90	14.50	10.92	16.00
Total:		58.67	68.14	67.14	73.26

Observations:
The most significant point of the results is the very small spread of the
point values."

They go on to note that AMD didn't have an Ada compiler
available at the time, and so got zapped on performance.  They also note
that they scaled up the scores for MIPS and SPARC because faster
chips became available than what had been benchmarked.  They noted the
difficulty of establishing objective criteria, saying:
	"To this end, four meetings and the intervening months were devoted
	to establishing the criteria against which the ISAs would be
	evaluated.  As in any other venture, if we were to start over, we
	would probably produce a somewhat different set of criteria, with
	results that might be more valuable in their ability to differentiate
	between the ISAs....It was also noted that when actual evaluation
	was started, the meaning of several of the criteria were obscure
	and had to be clarified.
Conclusions:
Since these ISAs, and their implementations, are competing in the market
place, it is not surprising that none of the ISAs were exceptionally better or
worse than any of the others...Due to there not being a typical application,
it is not possible to make a definitive general recommendation.  In general,
any of the ISAs will serve well.  Given a specific application, with its
own priorities and constraints, one of the implementations will probably
serve that purpose better than another."

*************************
Thus, the outcome of the study, clearly stated, was:
	a) It's hard to create objective criteria.
	b) They cannot make any definitive recommendations of one over another.
*************************

The next section gives the various details of rating points, for the
first two categories.
These were done by consensus scoring of features.  For example,
"Support for cache coherency
	AMD 29000	2
	MIPS R3000	5
	National 32000	2
	Sun SPARC	8"
(There are pages of such things; some of the numbers make sense, some
are inexplicable to me, but that's OK. This particular one is somewhat
inexplicable... Some of the ratings directly contradict the findings
of people like JMI, whose C Executive runs on many micros, and which
MEASURED things like interrupt-handling and context switching,
rather than consensus-estimating them.
Under "Current implementations", there were good things like:
"How many compatible performance variations are available?
	AMD29000	1
	MIPS R3000	3
	National 32000	5
	Sun SPARC	5"
(Interesting: it doesn't matter whether an implementation covers
a wide range of performance, what counts is the number of different ones.
Note that the .4 difference (5/5 - 3/5 accounts for more than the full
difference in the final ratings for this section.....)

Finally, we come to the benchmark section, which contains additional
ratings of the type above, plus one section for actual benchmarks.
Sun SPARC is given 50 points (24.5 mips), and the R3000 39.1 (19.15 mips).

I deleted the NS32532 column for space reasons, and added the data column
at the right (which was the Ada compiler, -O, and whose results were
available May 1989 and posted shortly thereafter (I think) on the JIAWG
bboard by the TI folks.

	The benchmarks total 2200 Lines Of Code  Ada),
and are mixture of integer and floating point, as follows:

bin_clst	binning & clustering: 135 LOC, integer
boomult	multiplies boolean matrices together, 102 LOC
des1	encryption, 346 LOC
dig_fil	64-bit FFT, 647 LOC
eightqueens	integer, 98 LOC
finite2	char->float conversions, 165 LOC
flmult	float matrix multiplication, 106 LOC
inmult	integer matrix mult, 81 LOC
kalman	flt/integer, matrices, 324 LOC
shell	shell sort, 52 LOC, integer
substrsrch	substring text search, 103 LOC

Now, here is the data presented in the report, plus my addition of
the last column:

                   VAX 11/780   VAX 11/785   R3000          SPARC    R3000 -O
                   DEC          DEC          MIPS Inc.      SUN      MIPS
                                             25 MHZ         25MHz    25MHz
Times in millisec, followed by results in MIPS,
Normalized to VAX 11/780 =1 (Note 3)

bin_clst	        0.51        0.48          0.05        0.08     0.04
boomult               981         658           246          49.99    29
des1                  160         111                        13.33
dig_fil            111000        2830            70         106.66    55
eightqueens            30          21             1.58        1.65     1.29
finite2                12           9             0.70        0.71      .60
flmult                765         429            81          65       24
inmult                789         495           104                   53
kalman                480         330            57          51.66    27
shell                   5           3.1           0.48        0.47      .31
substrsrch             12           9             0.65        0.55      .35


bin_clst                1.00        1.06         10.20        6.38    20.00
boomult                 1.00        1.49          3.99       19.62    33.80
des1                    1.00        1.44                     12.00
dig_fil (note 3)        0.03        1.00         40.43       26.53    51.5
eightqueens             1.00        1.43         18.99       18.18    23.25
finite2                 1.00        1.33         17.14       16.90    20.00
flmult                  1.00        1.78          9.44       11.77    31.87
inmult                  1.00        1.59          7.59                14.89
kalman                  1.00        1.45          8.42        9.29    17.78
shell                   1.00        1.61         10.42       10.64    16.13
substrsrch              1.00        1.33         18.46       21.82    34.28

Average                 0.91        1.41         14.51       15.31    26.35

Average for 33MHz R-3000 and 40MHz SPARC	 19.15       25.16

Note 3) dig-fil results are normalize (sic) to VAX 11/785 results.
Data sources:
VAX results provided by JIAWG/WPAFB
R3000 results provided by TI
SPARC results provided by Sun
-------------------------------------------------------
-------------------------------------------------------

Now, here's a good exercise for the reader: what do you believe from
the data above?  What conclusions can you draw, and why?
What problems might there be?

1. The benchmarks are very short: remember the times are in milliseconds,
that is, numbers as low as 40 microseconds are listed.
	=> benchmarks should be longer

2. There are holes in the data.  The des1 entry for MIPS is missing
(there was an obscure bug in the Ada front-end at that point).
The inmult benchmark for Sun was missing, for reasons I don't know.
It is very difficult to compute averages of data where it's missing, because
some benchmarks are tougher than others, and if your best or worst benchmark
gets left out, it can affect the results. (This is why it's so nice to
have the SPEC benchmarks: it was always a pain getting a complete set of
numbers for the MIPS Performance Brief)>
	=> delete the rows that have missing data.

3. The average is an arithmetic average, NOT a geometric mean.
(Geometric mean is a better measure for analyzing ratios.)
	=> use geometric mean for averaging ratios.
Also, one of the datapoints is normalized differently (to a 785).

4.If you compute the Geometric means, having deleted the two rows that are
missing data, you get: MIPS: 12.63, SPARC: 14.36, MIPS (opt): 25.8.

5. Just scaling up clock rates is meaningless, computers don't work
that way, because the memory systems are relevant.  Suppose you give SPARC
a 40MHz clock rate: that get's its geometric mean = 14.36x40/25 = 22.98,
i.e., not as fast as the MIPS at 25MHz....

6. Of course, the variance of all this data is pretty high: with 9 data
points used, the 95% confidence levels for the 3 are:
	MIPS:	[7, 23.5]
	SPARC: [10.6, 20.7]
	MIPS -O: [18.9, 36.4]

Anyway, this is why the committee carefully said that the overall data
didn't mean very much.  Of course, the committee report came out AFTER
the JIAWG decision was made [i.e., it was irrelevant to that],
and this report explicitly did NOT recommend anything as the architecture
for military projects.

Lessons:
1) It's hard to evaluate things on paper.  I think the committee tried
hard, in a really difficult job, but it's real hard...

2) It's always a good idea to look behind the summaries a bit.

3) It's important to understand the difference between numbers than
mean something, and numbers that don't.  The committee did understand
that there was insufficient difference to prove anything.

Now, everyone interprets data a bit differently,  Just for fun, let's look
at how Frank Yien and Scott Thorpe of Sun interpreted this, in
SunTech Journal, Autumn 1989, page ST8, in the article called:
	"SPARC Scores In DARPA/SAE Architecture Test"

(THERE'S BEEN PLENTY OF DATA; NOW WE GET SOME "MARKETING" ANALYSIS;
QUIT NOW IF YOU DON'T LIKE THAT STUFF. I INCLUDE THIS BECAUSE I'VE
ALREADY GOTTEN QUESTIONS FROM PEOPLE ABOUT IT, AND THE ARTICLE HAS
APPARENTLY GIVEN TO PEOPLE ABROAD TO PROVE THAT SPARC WAS SOMEHOW
A U.S. RECOMMENDED STANDARD....)

The article leads off with:

"In a recent comparison of leading 32-bit architectures by DARPA (the Defense
Advanced Research Projects Agency), the SPARC architecture was ranked
as the top processor architecture for use in military projects."
	Well, it had the highest numbers, but they weren't significant,
	and the committee said so.
	Of course, it didn't matter much anyway, because the key
	decisions were being made somewhere else, and the choices elsewhere
	[MIPS & Intel] reflected what the large contractors decided in
	doing serious evaluations.

"Finally, SPARC won the benchmark category, without using the most powerful
SPARC implementations available from SPARC manufacturers today.  The 80-MHz
ECL SPARC implementation was not used in these comparisons;"
	Of course it wasn't; the embedded avionics market is not excited by 
	ECL, and Sun didn't have an ECL system for them to benchmark anyway.
	So what does ECL SPARC have to do with it?
"instead, the 40-MHz CMOS SPARC implementation was benchmarked and still
won easily, since the others have only 33-Mhz chips."
	They didn't benchmark a 40-MHz implementation, they benchmarked
	a 25Mhz one and then multiplied by 40/25.  Note that no 40Mhz SPARC
	SYSTEM has yet been announced, much less delivered.
	It didn't win easily, it won barely @ 25Mhz, and if they had reported
	the correspondingly-optimized MIPS numbers, a 40MHz SPARC (not yet
	delivered in system) is seen from the chart above
	to be SLOWER than a 25MHz R3000 [slower on the average, and slower on
	8 out of the 11 benchmarks, the only exceptions being
	eightqueens, finite2, and shell, hardly the larger/more realistic
	tests].
"note that military benchmarks are very demanding and closely resemble
compute-intensive engineering/simulation environments."
	Military benchmarks can be demanding all right, but:
	some of these are very small benchmarks.  Some of these benchmarks
	are realistic, and some are pretty small; none have any real-time
	component that I could see.  If you believe there's a correlation
	between these benchmarks and engineering ones, that's good,
	because MIPS is faster.  If you don't believe there's much
	correlation, that's fine too....

"SPARC is winning the technology battle: It is the frequency leader in both
CMOS and ECL technologies and ranks first in independent tests.  SPARC
hardware and software vendors are well positioned for the future."

	Well, each to their own.... Note that the real war for the 32-bit
	RISC embedded defense standard seems to have 2 winners, and SPARC
	wasn't one of them.... It's possible that some people missed this,
	although it sure made the defense magazines...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086


Brought to you by Super Global Mega Corp .com