Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!rutgers!apple!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: SPARC vs. MIPS on gcc
Keywords: SPARC, MIPS
Message-ID: <10436@winchester.mips.COM>
Date: 31 Dec 88 03:42:13 GMT
References: <82150@sun.uucp> <697@hscfvax.harvard.edu> <677@helios.toronto.edu> <3790@druhi.ATT.COM>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 320

Well, I've been over in the Far East for a couple weeks and then out with the
holidays, so I'm only just finally getting dug out enough to read the news and
see I've missed all the fun.
Thanx to Ed Kelly for starting this, raising some issues, and posting some
actual numbers so that people can do some analysis.  Some of the
comments I might have made have already been made by people quicker with the
keyboards.  This also stirred me up to complete some relevant discussion
that I've been working on for a while.

Here's the outline:

PART 1: Necessary Benchmarking Background (mashey)
1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS
	CASE 0: (generic case)
	CASE 1: (R(i) = Geom(B) for all i)
	CASE 2: Modest variation
	CASE 3: (even more variation)
	CASE 4: wild variation
1.2 HARDWARE FACTORS THAT AFFECT VARIATIONS
1.3 SPEC
1.3 IMPLICATIONS

PART 2A: Some quick notes on duplicating the benchmark (killian)
PART 2B: Detailed Analysis of the GCC Benchmark (mashey)

(to follow: Part 1 is already long; too many of the relevant people have been
away, and we had a little trouble duplicating the numbers; look for it
in about a week; it's about the same size as this one.)

---------
1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS

Suppose you run N benchmarks on two machines, A and B.  For benchmark i, let
T(i,x) be the time to run on x, and compute the performance ratio R(i) for
benchmark i of T(i,A)/T(i,B).  [If A happened to be a VAX-11/780 under VMS, this
number would be the VUP (VAX Unit of Performance) number for the specific
benchmark.]
Now, renumber the benchmarks in order of increasing R(i) and graph the result,
on a scale where (of course) the performance of A is 1 on every benchmark,
and where Geom(x) = geometric mean of the performance ratios for machine x.
(Why geometric mean?  Arithmetic mean is wrong for ratios; one could argue
about harmonic...)  Here are some cases you might see:

CASE 0: (generic case)
Big R 	|
	|						B	B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
Small R	|B
	 1	2	3	4	5	6	7...	N

Just to be clear, this means B underperforms Geom(B) on 1-4 and outperforms
it on 6...N.

People who've seen graphs like this sometimes ask why we usually sort it
to get one or more of the machines monotonic or close.  Nothing magic: we've
tried various ways, and this has the least visual confusion, especially
when graphing 3-5 machines together on one chart.  Of course, it's not generally
possible to get ALL machines in such a chart monotonic, especially for
unrelated machines; nevertheless, people seem to see the graphs easier
if some sorting is done to remove the jerkiness.  Finally, it is
sometimes easier to see patterns when this is done.

Of course, I didn't say where Geom(A) (= 1.0 by definition) was on the chart:
it could be < Geom(B), in which case you'd claim that A was slower than
B on this set of benchmarks, it could be ==, it could be >.

Now, one can observe some interesting cases, depending on the benchmarks
selected:

CASE 1: (R(i) = Geom(B) for all i)
R 	|
Geom(B)	|B	B	B	B	B	B	B	B
	|					
Geom(A)	|.	.	.	.	.	.	.	. (1.0)
	|
	|
	 1	2	3	4	5	6	7...	N

In this case, B is uniformly faster than A.  The only realistic circumstances
I've ever seen this close to happening is where you're doing CPU benchmarks
of two machines that use the same CPU, same software, same memory system,
and only differ in clock rate THAT SCALES UNIFORMLY THROUGHOUT THE CPU+MEMORY
SYSTEM.  Needless to say, few peripherals work this way, so I/O tests of
two systems like this don't scale the same way.  In addition, on some
tests you might get surprised by: timer interrupts (the faster
machine has less of them to deal with if the timer isn't scaled up, which
usually is not done), or DRAM refresh overhead (maybe).  That means most
graphs look more like CASE 0 after all, which means that it is now time to look
at the values on the vertical axis.

EXAMPLE 1: at MIPS, the closest case to this is the M/800(M/1000) pair,
which are essentially identical, except for the clock rates (12.5Mhz or 15MHz).

EXAMPLE 2: various PC families show this behavior, differing only by clock rate.

Consider the case where there is some variation, but many of ratios
cluster around the same value, probably close to the geometric mean:

CASE 2: Modest variation
R 	|
1.1G(B)	|						B	B
Geom(B)	|.	.	B	B	B	B	.	.
	|	B
.9G(B)	|B
	 1	2	3	4	5	6	7...	N

(Note, this doesn't mean that Geom(B) == 1, it means that the R values
cluster from .9*G(B) to 1.1*G(B).

You'd guess from this that the two machines are part of the same family,
with the same software, and moderate differences in clock-rate or
small details of implementation. (Note: major differences in clock-rate
will probably spread things further apart).

EXAMPLE 1:
VAX 8650:8700 comparison looks this way (I think), with differences in
pipelining and the switch from write-back cache (in 8650) to write-thru
(in 8700) sometimes making noticeable differences in either direction. (Compare
differences in Whetstone & Linpack between them, for example).  I think the
two machines are about the same speed; maybe someone from DEC will comment.

EXAMPLE 2: 386-based PCs sometimes show this behavior more than do 286-based
ones, as the former more often use different sorts of memory systems.

EXAMPLE 3: the MIPS M/1000 versus M/120 is sort of this way:
	a) Both use 64KI + 64KD caches, 4-deep write-buffers, 1-word
	cache-refill.
	b) The clock rate changed from 15MHz to 16.7MHz, which by itself
	would not spread the ratios.
	c) The M/120 has a lower-latency memory system, i.e., so that it
	survives high-cache-miss rate programs better.  Thus, programs
	with low cache-miss rates will tend towards the left of the chart,
	where a 120 gets only the clock-rate difference (16.7/15);
	higher cache-miss rate programs will tend towards the right side.

CASE 3: (even more variation)
R 	|
1.5*G(B)|						B	B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
.75*G(B)|B
	 1	2	3	4	5	6	7...	N

This says that there is a 2X variation (1.5/.75) in the relative
performance of the two machines.  This is not at all atypical of a
randomly-selected pair of real machines.  As shown by DEC (McInnis,
Kusik, and Bhandarkar, "VAX 8800 System Overview", IEEE CH2409-01/87)
a VAX 8700 was anywhere from 3X to 7X faster
than an 11/780, even with the same software.   Most of the MIPS systems
versus VAXen have a similar-looking chart, as a gross first-order
approximation.   Some of the more extreme MIPS pairwise combinations
get up around a 1.3X variation (for example, M/2000 versus M/1000):
	if a benchmark has a low cache miss rate, the ratio is close to
		the clock-rate difference.
	if a benchmark has a high data cache miss rate, and block-fetch
		works, the 2000 is better than the 1000 by more than the
		clock rate.
	if a benchmark has a high data cache miss rate, and block-fetch
		DOESN'T work (compress is the notorious example), then
		a 2000 is not as much better as the clock-rate difference.
		(Compress is notorious because it hashes data into a huge
		sparse array & 1-word-refilled data caches are BETTER
		than N-word-refilled caches, which is not often true.)


CASE 4: wild variation
100G(B)	|							B
	|						B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
.5G(B)	|B
	 1	2	3	4	5	6	7...	N

This is what you typically would see when comparing a vector machine (B)
with a scalar machine, or maybe two vector machines optimized towards
shorter or longer vector lengths, or multiprocessors of various kinds,
or....  There is of course, nothing inherently bad about such variation,
except that the MORE VARIATION THERE IS, THE MORE YOU'D BETTER BE CAREFUL
ABOUT YOUR OWN WORKLOAD AND UNDERSTANDING WHICH BENCHMARKS, IF ANY, ARE
TRULY REPRESENTATIVE OF IT.

1.2 HARDWARE FACTORS THAT AFFECT THE DISTRIBUTIONS IN CPU PERFORMANCE:

1) Cached versus non-cached systems
The easiest machines to compare are simple non-cached ones with simple
memory systems, because the graphs will tend to look like CASE 2.
In particular, a few integer benchmarks, and a few FP ones will quickly
give you some idea of what is happening.  Unfortunately, this domain is
basically limited to the slower microprocessor designs, as most others
either use caches, or if not, not may be vector machines with memory systems
optimized for vector transfers.

2) If cached:
	size of cache
	joint versus split I & D
	level of associativity
	write-thru versus write-back
	linesize
	block-transfer size
	etc, etc.
	Size is one of easiest ones to get surprised with, especially
	on scientific benchmarks with varying array sizes.
	(You can REALLY get misled if you happen to pick a benchmark where
	the particular size happens to fit into machine A's cache, but not
	quite into machine B's.  You can get odd effects where B performs
	relatively better on small problems and big problems, but A does
	better on middle-sized problems.)  This is becoming especially
	relevant as caches grow in size to consume popular benchmarks,
	i.e., the typical 100x100 Linpack is noticeably helped by current
	caches! (I think this is why Dongarra & friends are emphasizing
	plotting of MFLOPS rates over many array sizes to avoid weird
	single-point effects.)

3) Memory system
	random-access-equal versus random-access-unequal variations,
	such as when using page-mode DRAMS, SCRAM-cache (as in Sun-4/110),
	banking schemes, etc, where any of the following might occur:
		2nd access to same page is faster than random
		2nd access is faster if not to the same bank

4) Vector versus scalar system design
	This is one of the most common causes of large variations in
	performance.

5) Multi-processor versus uniprocessor

6) Memory-management design.
	Small programs; big programs; sparse versus dense data, etc, etc.

ALL OF THE ABOVE CAN AFFECT SINGLE-THREAD PERFORMANCE ON CPU COMPUTATIONAL
PERFORMANCE.  Finally, along with other design elements (like exception-
	processing), they can affect other performance attributes, which
	this analysis has no pretense of doing anything but listing:
	multi-user/server (versus workstation) performance
	big jobs versus little ones
	user code versus kernel code (they're different)
	commercial versus technical applications
	balance across different applications versus tuned for a few

1.2 SPEC
Most high-speed machines OF COURSE use combinations of the things that
will cause more variations.  This is one of the reasons that {Apollo,
HP, MIPS, Sun} have started SPEC: a lot of the simpler benchmarks used in
the micro world just don't cope very well with the inherent variability
of current high-performance machines; we need more good benchmarks that
cover a wider range of applications.  It is not that this is a new problem,
of course, it's just that in the last few years, fast machines are getting
very cheap, and so the complexity issues found in mainframes/supercomputers,
and then superminis, are migrating into cheap machines, and hence are
more visible to more people.

1.3 IMPLICATIONS
1) If you have N benchmarks, and you select M << N of them, you can usually
prove almost anything about the relative performance of A & B.
For example, there exist benchmarks that can drag almost any machine
down to DRAM random-access speed, no matter what its cache architecture is.
Of course, different machines drag down different architectures worse.
Some of the nasties are even real programs (like compress, for example).
Note that this implies that a GOOD benchmark suite would:
	a) Avoid tiny programs that fit into trivial caches.
	b) Include some nasties that break everybody's caches (at least
	this year; next year's caches are going to be harder to fill!)
	c) Include some that are real programs, but may fit into
	some caches. [You can't hope to use only b), because caches are
	getting bigger fast, and many real programs are significantly
	helped by 1988/1989 microprocessor caches.]
	d) When possible, have benchmarks whose data sizes are easily
	variable, so that you can plot multiple points, avoiding surprises
	from happenstances of sizing.  This is probably easiest with
	scientific programs; sometimes it's just about impossible with
	some systems programs.  Note that the point is not to make a benchmark
	run long enough to be interesting [that's a separate issue], but to
	vary the sizes to better analyze memory-system effects.

2) As usual, your own application is the best benchmark.  If you don't have 
that, your best bet is to hope to find that some of the benchmarks are
ones that you've found usually correlate with your own applications.

3) If you are able to make N fairly large, you may be able to find
patterns, like "good integer performance, bad floating-point",
"good vector floating-point, bad integer", "good on small benchmarks,
bad on big ones", etc.  The most common patterns are:
	a:Balanced integer and scalar floating-point (superminis, mainframes)
	b:Integer noticeably better than FP (most microprocessors)
	c:FP better than integer, and vector FP much better than scalar VP,
	relative to the mainframe/supermini pattern (most supercomputers and
	mini-supers, although not all.)

4) Most of this was about CPU performance.  When you add other realistic
systems benchmarks, it only gets more complicated, although some of the
same syndromes show up [like testing disk I/O with repeated access to
files that do / do not fit into UNIX buffer caches.]

In PART 2, we'll look at the gcc benchmark described by Ed Kelly in
<82150@sun.uucp>.  Although the test case shown doesn't run as long as one would
like, using gcc as a benchmark is a reasonable and worthy thing, as it:
	a. is a large integer program
	b. is widely available in source form, if not public domain
	c. is an example of something that people really use
and hence, it is probably a good candidate for inclusion in good benchmark
suites, especially as it is excruciatingly hard to get compilers that obey b.

It is also a good example to analyze in detail:
	It is typical of other integer programs in some ways
	It is somewhat atypical of them, in other ways
	It offers a good example of some of the specific cautions mentioned
		above in data interpretation, especially as applied to
		fast machines with differing memory system designs
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086