Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!vsi1!daver!mips!earl@wright.mips.com From: earl@wright.mips.com (Earl Killian) Newsgroups: comp.arch Subject: Re: SPARC vs. MIPS on gcc Keywords: SPARC, MIPS Message-ID: <10574@wright.mips.COM> Date: 3 Jan 89 17:06:28 GMT References: <82150@sun.uucp> <697@hscfvax.harvard.edu> <677@helios.toronto.edu> <3790@druhi.ATT.COM> <10436@winchester.mips.COM> Sender: earl@mips.COM Organization: MIPS Computer Systems, Sunnyvale CA Lines: 214 Ed Kelly of Sun studied gcc compiling gcc.c on MIPS and SPARC, and posted some statistics together with his analysis and conclusions. I decided to take a look myself (also, it's a likely SPEC benchmark, so understanding it will be useful). At first I was unable to duplicate Kelly's statistics. gcc compiled on MIPS with cc -O3 and ran without hitch, whereas Kelly said -O3 didn't work (-O4 also works if you fix a trivial bug in the gcc source). Subsequently we were told that Sun's -O3 problem was that it ran out of space in /tmp on their machine and not a compiler bug. With -O3 I get 17.40M instructions. At -O2, I get 17.82M instructions instead of his 18.64M, so there was a big difference to explain. The major difference between -O2 and -O3 is inter-procedural register allocation. A minor difference is that -O2 by default declines to optimize "big" procedures (> 500 basic blocks) to save on compilation time during program development. It warns you by saying uopt: Warning: expand_expr: this procedure not optimized because it exceeds size threshold; to optimize this procedure, use -Olimit option with value >= 656. For benchmarking, I go back and add a -Olimit to the Makefile and recompile, just as the warning suggests. If I leave off the the -Olimit then several procedures remain unoptimized and the result is 18.25M instructions. Closer to Kelly's result, but still not there. (Note that two of the unoptimized procedures are yyparse and yylex, which are the 2nd and 3rd heaviest contributors to CPU cycles...) Kelly was running this benchmark on a System V M/1000 as opposed to a BSD M/1000 (MIPS sells both flavors of Unix). When I tried it on System V I got link errors for BSD-only routines such as bcopy and bzero, which I solved by adding -lbsd to the command line. My guess is that Kelly didn't know about -lbsd and choose to use straight-forward byte-at-a-time bcopy/bzero substitutes. When I try that I get 18.68M instructions, which is quite close to his result. In summary: 18.68M -O2, no opt of yyparse, yylex etc., no use of library bcopy/bzero 18.64M posted number 18.25M -O2, no opt of yyparse, yylex, etc. 17.82M -O2, optimize yyparse, yylex, etc. 17.40M -O3 (All results use the MIPS 1.31 compilers, which were released in Mid 88.) The point of this was to show that Kelly's analysis was built on questionable statistics. But even with his statistics as a basis, some of his conclusions are unwarranted. As many people pointed out, gcc is only one data point, and it is unreasonable to conclude anything from a single data point. There might be something anomalous in that one case, for example. One thing I learned in porting gcc is that the MIPS compiler generates poor code for a C construct that gcc uses heavily (a bit-field enum that is both aligned and 16 bits in length). Oh well, every compiler has some simple things it doesn't bother to special case. This will be fixed in a future compiler release. With that compiler the gcc instruction count on Kelly's input is 16.47M instructions at -O3 (about 6% fewer instructions). It is exactly this sort of sensitivity to small details that make single data point conclusions unreliable. It also turns out that 6% of the instruction cycles are spent in printf etc. I don't know whether the SPARC printf has been heavily tuned or not; ours has not. It is fair to include the cost of this as a system test: that's what the user sees. However, it is hard to draw conclusions about Instruction Set Architecture (ISA) + Compilers, where one is concerned about a % here or there, when noticeable parts of the code are from libraries. With those caveats in mind, let's look at some of Kelly's remarks: "As will surprise most observers, SPARC executes fewer instructions than MIPS." This doesn't surprise me when I look closer and see how the instruction counts differ. After all, the RISC vs. CISC wars were begun with the premise that instructions were only one term in the performance equation. Total performance is what matters. As several people pointed out on the net, the difference in instruction counts is primarily attributable to MIPS using a NOP instruction instead of a hardware interlock for load instructions (shifting responsibility from hardware to software). With interlocking, the load NOPs would be replaced by a single-cycle stall, so the load NOPs have no direct performance impact (an indirect effect is the increase in code size affects i-cache miss rates). To compensate for the difference in interlocking approach (hardware vs. software), you can either subtract load nops (.91M) from the MIPS counts or add SPARC interlocks (1.47M) to the SPARC counts. With our 1.31 compilers, that makes the difference +-1% for adjusted instruction count. (With the compiler that optimizes aligned 16-bit bit-fields to halfwords, it is 5 to 8% in favor of MIPS.) But again, instruction counts aren't a good basis for comparison. I don't think you can compare ISAs without looking at implementations. For example, MIPS has a divide instruction and SPARC has none. Should we add in our divide interlocks to be fair? But a hypothetical MIPS machine could have a 8-cycle divide, so maybe we ought to use 8, not 35, in ISA comparison? How can this work? In contrast, comparing cycles or time is more meaningful. Kelly gives 25.52M as the raw cpu cycle count. The corresponding MIPS number (1.31 compilers) is 17.74M. The large difference is of course due to the Fujitsu SPARC chip using one extra cycle on loads, 2 extra cycles on stores, and one extra cycle on untaken branches. To go beyond cpu performance we need to pick a memory system. This is probably a good place to point out that the M/1000 Kelly used is a lower performance machine than anything we now sell; it has been essentially obsoleted by the 16.7MHz M/120 (like the M/1000, based on the R2000) and the 25MHz M/2000 (based on the R3000), both of which are in production and shipping. Adding in cache miss cycles, Kelly gives a total of 29.95M cycles for the Sun 4/280. For the MIPS M/120 I get 24.19M (27.80M for the M/1000). Since the cycle time is the same for both the 4/280 and the M/120, the cycle counts are directly related to time. I don't think there's much to squabble about here. Time is time. All the trade-offs have been reduced to a single number. Kelly might object that a hypothetical SPARC implementation might avoid the extra load/store/branch cycles. Such an implementation is said to be in progress. When it's appropriate, why not use it for comparison with the corresponding MIPS system? "For many observers the interesting fact is that for this benchmark, the MIPS compiler is not significantly better than the current SPARC compiler. Considering the bad press, I will admit I was surprised by this myself." This statement was unsubstantiated; it is not obvious to me how to compare compilers based on instruction statistics from different architectures, especially on only one benchmark. The few things that do come to mind suggest that the MIPS compiler is doing a better job, but given the importance of library code in this benchmark, the whole subject is on thin ice. Perhaps Kelly can elaborate? "Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY fundamentally better, but the degree of difference is probably in the noise in the broader scheme of things." (-: Gee, being a MIPS advocate, and given the corrected numbers, should I claim that MIPS ISA is 5-8% fundamentally better? :-) Kelly moves on to discuss the architecture of the entire system, not just the ISA. I have some quibbles with his methodology (e.g. inferring anything from Unix runtimes on the order of 1-2 seconds, where the error per measurement is probably 10% or more), but I really have to restrict myself to addressing a few of his off-hand remarks (this posting is already too long). "These numbers represent significant differences in the IMPLEMENTATION philosophies at Sun and at MIPS. The central goal at MIPS appears to have been to achieve a single cycle per instruction, even at the cost of cycle time and complexity. Clearly that was not a central goal at Sun." Certainly single cycle execution was one of several MIPS goals, but I would not say it was at expense of cycle time or complexity at all. The most significant pressure on cycle time in the R2000 is due to physical instead of virtual caches, not single-cycle execution. Virtual caches simplify the CPU at the expense of multi-programming performance and multi-processing implementation complexity. "Our goals were dominated by cycle time and system simplicity. Performance on large programs was our design metric. The first SPARC implementation achieved a faster cycle time than the best of MIP's first implementations, despite inferior technology." This is not true. Both the Fujitsu SPARC and the R2000 are 16.7MHz chips. The M/1000 system, based on the R2000, was 15MHz instead of 16.7MHz because it used memory boards from the M/500 generation system (you could upgrade with a cpu board replacement), and those memory boards are good to 15MHz. (The M/500 was introduced 18 months before the Sun 4/260.) Both MIPS and its customers ship systems based on the R2000 at 16.7MHz (the M/1000 just isn't one of them.) Is the Fujitsu SPARC implemented in an inferior technology to the R2000? That's hard to call. The Fujitsu SPARC is implemented in what is, I think, a 1.5 micron CMOS gate array technology whereas the R2000 is implemented in 2.0 micron custom CMOS technology. I'm not sure how to compare these particular apples and oranges. "The MIPS performance brief has concentrated on relatively small integer programs that fit in the cache and so benefit well from the single cycle loads and stores." The MIPS performance brief concentrates on large programs. It is the case that the large programs are floating point; large public domain floating point programs are easier to find than large public domain integer programs. The UNIX commands listed in the Brief are at least reasonably-sized real programs, not toys, and they're what a lot of people use. What about the Sun performance brief? It relies on the dhrystone and stanford benchmarks, which are much smaller than the MIPS Unix suite. "This overstates the integer performance for large programs, which are after all what people buy fast machines to run. MIPS implicitly acknowledges this by calling the M1000 a 10 MIP box despite the fact that all the published data in the MIPS performance brief would say integer performance is greater than 12 MIPs." Unlike Sun, but like DEC, we consider both floating point and integer performance when assigning a VUPS (sometimes called MIPS) rating to our machines. And yes, we don't use toys like dhrystone and stanford for our ratings (we give results because they're popular). Read section 2.1 of the MIPS performance brief for details. Is there something wrong with basing ratings on large, real programs? -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086