Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!rutgers!apple!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: SPARC vs. MIPS on gcc Keywords: SPARC, MIPS Message-ID: <10436@winchester.mips.COM> Date: 31 Dec 88 03:42:13 GMT References: <82150@sun.uucp> <697@hscfvax.harvard.edu> <677@helios.toronto.edu> <3790@druhi.ATT.COM> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 320 Well, I've been over in the Far East for a couple weeks and then out with the holidays, so I'm only just finally getting dug out enough to read the news and see I've missed all the fun. Thanx to Ed Kelly for starting this, raising some issues, and posting some actual numbers so that people can do some analysis. Some of the comments I might have made have already been made by people quicker with the keyboards. This also stirred me up to complete some relevant discussion that I've been working on for a while. Here's the outline: PART 1: Necessary Benchmarking Background (mashey) 1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS CASE 0: (generic case) CASE 1: (R(i) = Geom(B) for all i) CASE 2: Modest variation CASE 3: (even more variation) CASE 4: wild variation 1.2 HARDWARE FACTORS THAT AFFECT VARIATIONS 1.3 SPEC 1.3 IMPLICATIONS PART 2A: Some quick notes on duplicating the benchmark (killian) PART 2B: Detailed Analysis of the GCC Benchmark (mashey) (to follow: Part 1 is already long; too many of the relevant people have been away, and we had a little trouble duplicating the numbers; look for it in about a week; it's about the same size as this one.) --------- 1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS Suppose you run N benchmarks on two machines, A and B. For benchmark i, let T(i,x) be the time to run on x, and compute the performance ratio R(i) for benchmark i of T(i,A)/T(i,B). [If A happened to be a VAX-11/780 under VMS, this number would be the VUP (VAX Unit of Performance) number for the specific benchmark.] Now, renumber the benchmarks in order of increasing R(i) and graph the result, on a scale where (of course) the performance of A is 1 on every benchmark, and where Geom(x) = geometric mean of the performance ratios for machine x. (Why geometric mean? Arithmetic mean is wrong for ratios; one could argue about harmonic...) Here are some cases you might see: CASE 0: (generic case) Big R | | B B | B Geom(B) |. . . . B . . . | B | B B Small R |B 1 2 3 4 5 6 7... N Just to be clear, this means B underperforms Geom(B) on 1-4 and outperforms it on 6...N. People who've seen graphs like this sometimes ask why we usually sort it to get one or more of the machines monotonic or close. Nothing magic: we've tried various ways, and this has the least visual confusion, especially when graphing 3-5 machines together on one chart. Of course, it's not generally possible to get ALL machines in such a chart monotonic, especially for unrelated machines; nevertheless, people seem to see the graphs easier if some sorting is done to remove the jerkiness. Finally, it is sometimes easier to see patterns when this is done. Of course, I didn't say where Geom(A) (= 1.0 by definition) was on the chart: it could be < Geom(B), in which case you'd claim that A was slower than B on this set of benchmarks, it could be ==, it could be >. Now, one can observe some interesting cases, depending on the benchmarks selected: CASE 1: (R(i) = Geom(B) for all i) R | Geom(B) |B B B B B B B B | Geom(A) |. . . . . . . . (1.0) | | 1 2 3 4 5 6 7... N In this case, B is uniformly faster than A. The only realistic circumstances I've ever seen this close to happening is where you're doing CPU benchmarks of two machines that use the same CPU, same software, same memory system, and only differ in clock rate THAT SCALES UNIFORMLY THROUGHOUT THE CPU+MEMORY SYSTEM. Needless to say, few peripherals work this way, so I/O tests of two systems like this don't scale the same way. In addition, on some tests you might get surprised by: timer interrupts (the faster machine has less of them to deal with if the timer isn't scaled up, which usually is not done), or DRAM refresh overhead (maybe). That means most graphs look more like CASE 0 after all, which means that it is now time to look at the values on the vertical axis. EXAMPLE 1: at MIPS, the closest case to this is the M/800(M/1000) pair, which are essentially identical, except for the clock rates (12.5Mhz or 15MHz). EXAMPLE 2: various PC families show this behavior, differing only by clock rate. Consider the case where there is some variation, but many of ratios cluster around the same value, probably close to the geometric mean: CASE 2: Modest variation R | 1.1G(B) | B B Geom(B) |. . B B B B . . | B .9G(B) |B 1 2 3 4 5 6 7... N (Note, this doesn't mean that Geom(B) == 1, it means that the R values cluster from .9*G(B) to 1.1*G(B). You'd guess from this that the two machines are part of the same family, with the same software, and moderate differences in clock-rate or small details of implementation. (Note: major differences in clock-rate will probably spread things further apart). EXAMPLE 1: VAX 8650:8700 comparison looks this way (I think), with differences in pipelining and the switch from write-back cache (in 8650) to write-thru (in 8700) sometimes making noticeable differences in either direction. (Compare differences in Whetstone & Linpack between them, for example). I think the two machines are about the same speed; maybe someone from DEC will comment. EXAMPLE 2: 386-based PCs sometimes show this behavior more than do 286-based ones, as the former more often use different sorts of memory systems. EXAMPLE 3: the MIPS M/1000 versus M/120 is sort of this way: a) Both use 64KI + 64KD caches, 4-deep write-buffers, 1-word cache-refill. b) The clock rate changed from 15MHz to 16.7MHz, which by itself would not spread the ratios. c) The M/120 has a lower-latency memory system, i.e., so that it survives high-cache-miss rate programs better. Thus, programs with low cache-miss rates will tend towards the left of the chart, where a 120 gets only the clock-rate difference (16.7/15); higher cache-miss rate programs will tend towards the right side. CASE 3: (even more variation) R | 1.5*G(B)| B B | B Geom(B) |. . . . B . . . | B | B B .75*G(B)|B 1 2 3 4 5 6 7... N This says that there is a 2X variation (1.5/.75) in the relative performance of the two machines. This is not at all atypical of a randomly-selected pair of real machines. As shown by DEC (McInnis, Kusik, and Bhandarkar, "VAX 8800 System Overview", IEEE CH2409-01/87) a VAX 8700 was anywhere from 3X to 7X faster than an 11/780, even with the same software. Most of the MIPS systems versus VAXen have a similar-looking chart, as a gross first-order approximation. Some of the more extreme MIPS pairwise combinations get up around a 1.3X variation (for example, M/2000 versus M/1000): if a benchmark has a low cache miss rate, the ratio is close to the clock-rate difference. if a benchmark has a high data cache miss rate, and block-fetch works, the 2000 is better than the 1000 by more than the clock rate. if a benchmark has a high data cache miss rate, and block-fetch DOESN'T work (compress is the notorious example), then a 2000 is not as much better as the clock-rate difference. (Compress is notorious because it hashes data into a huge sparse array & 1-word-refilled data caches are BETTER than N-word-refilled caches, which is not often true.) CASE 4: wild variation 100G(B) | B | B | B Geom(B) |. . . . B . . . | B | B B .5G(B) |B 1 2 3 4 5 6 7... N This is what you typically would see when comparing a vector machine (B) with a scalar machine, or maybe two vector machines optimized towards shorter or longer vector lengths, or multiprocessors of various kinds, or.... There is of course, nothing inherently bad about such variation, except that the MORE VARIATION THERE IS, THE MORE YOU'D BETTER BE CAREFUL ABOUT YOUR OWN WORKLOAD AND UNDERSTANDING WHICH BENCHMARKS, IF ANY, ARE TRULY REPRESENTATIVE OF IT. 1.2 HARDWARE FACTORS THAT AFFECT THE DISTRIBUTIONS IN CPU PERFORMANCE: 1) Cached versus non-cached systems The easiest machines to compare are simple non-cached ones with simple memory systems, because the graphs will tend to look like CASE 2. In particular, a few integer benchmarks, and a few FP ones will quickly give you some idea of what is happening. Unfortunately, this domain is basically limited to the slower microprocessor designs, as most others either use caches, or if not, not may be vector machines with memory systems optimized for vector transfers. 2) If cached: size of cache joint versus split I & D level of associativity write-thru versus write-back linesize block-transfer size etc, etc. Size is one of easiest ones to get surprised with, especially on scientific benchmarks with varying array sizes. (You can REALLY get misled if you happen to pick a benchmark where the particular size happens to fit into machine A's cache, but not quite into machine B's. You can get odd effects where B performs relatively better on small problems and big problems, but A does better on middle-sized problems.) This is becoming especially relevant as caches grow in size to consume popular benchmarks, i.e., the typical 100x100 Linpack is noticeably helped by current caches! (I think this is why Dongarra & friends are emphasizing plotting of MFLOPS rates over many array sizes to avoid weird single-point effects.) 3) Memory system random-access-equal versus random-access-unequal variations, such as when using page-mode DRAMS, SCRAM-cache (as in Sun-4/110), banking schemes, etc, where any of the following might occur: 2nd access to same page is faster than random 2nd access is faster if not to the same bank 4) Vector versus scalar system design This is one of the most common causes of large variations in performance. 5) Multi-processor versus uniprocessor 6) Memory-management design. Small programs; big programs; sparse versus dense data, etc, etc. ALL OF THE ABOVE CAN AFFECT SINGLE-THREAD PERFORMANCE ON CPU COMPUTATIONAL PERFORMANCE. Finally, along with other design elements (like exception- processing), they can affect other performance attributes, which this analysis has no pretense of doing anything but listing: multi-user/server (versus workstation) performance big jobs versus little ones user code versus kernel code (they're different) commercial versus technical applications balance across different applications versus tuned for a few 1.2 SPEC Most high-speed machines OF COURSE use combinations of the things that will cause more variations. This is one of the reasons that {Apollo, HP, MIPS, Sun} have started SPEC: a lot of the simpler benchmarks used in the micro world just don't cope very well with the inherent variability of current high-performance machines; we need more good benchmarks that cover a wider range of applications. It is not that this is a new problem, of course, it's just that in the last few years, fast machines are getting very cheap, and so the complexity issues found in mainframes/supercomputers, and then superminis, are migrating into cheap machines, and hence are more visible to more people. 1.3 IMPLICATIONS 1) If you have N benchmarks, and you select M << N of them, you can usually prove almost anything about the relative performance of A & B. For example, there exist benchmarks that can drag almost any machine down to DRAM random-access speed, no matter what its cache architecture is. Of course, different machines drag down different architectures worse. Some of the nasties are even real programs (like compress, for example). Note that this implies that a GOOD benchmark suite would: a) Avoid tiny programs that fit into trivial caches. b) Include some nasties that break everybody's caches (at least this year; next year's caches are going to be harder to fill!) c) Include some that are real programs, but may fit into some caches. [You can't hope to use only b), because caches are getting bigger fast, and many real programs are significantly helped by 1988/1989 microprocessor caches.] d) When possible, have benchmarks whose data sizes are easily variable, so that you can plot multiple points, avoiding surprises from happenstances of sizing. This is probably easiest with scientific programs; sometimes it's just about impossible with some systems programs. Note that the point is not to make a benchmark run long enough to be interesting [that's a separate issue], but to vary the sizes to better analyze memory-system effects. 2) As usual, your own application is the best benchmark. If you don't have that, your best bet is to hope to find that some of the benchmarks are ones that you've found usually correlate with your own applications. 3) If you are able to make N fairly large, you may be able to find patterns, like "good integer performance, bad floating-point", "good vector floating-point, bad integer", "good on small benchmarks, bad on big ones", etc. The most common patterns are: a:Balanced integer and scalar floating-point (superminis, mainframes) b:Integer noticeably better than FP (most microprocessors) c:FP better than integer, and vector FP much better than scalar VP, relative to the mainframe/supermini pattern (most supercomputers and mini-supers, although not all.) 4) Most of this was about CPU performance. When you add other realistic systems benchmarks, it only gets more complicated, although some of the same syndromes show up [like testing disk I/O with repeated access to files that do / do not fit into UNIX buffer caches.] In PART 2, we'll look at the gcc benchmark described by Ed Kelly in <82150@sun.uucp>. Although the test case shown doesn't run as long as one would like, using gcc as a benchmark is a reasonable and worthy thing, as it: a. is a large integer program b. is widely available in source form, if not public domain c. is an example of something that people really use and hence, it is probably a good candidate for inclusion in good benchmark suites, especially as it is excruciatingly hard to get compilers that obey b. It is also a good example to analyze in detail: It is typical of other integer programs in some ways It is somewhat atypical of them, in other ways It offers a good example of some of the specific cautions mentioned above in data interpretation, especially as applied to fast machines with differing memory system designs -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086