Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!aisling!edkelly From: edkelly%aisling@Sun.COM (Ed Kelly) Newsgroups: comp.arch Subject: SPARC vs. MIPS on gcc Keywords: SPARC, MIPS Message-ID: <82150@sun.uucp> Date: 17 Dec 88 01:31:48 GMT Sender: news@sun.uucp Reply-To: edkelly@sun.UUCP (Ed Kelly) Organization: Sun Microsystems, Mountain View Lines: 306 A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. For the comparison we chose a large portable C program (the GNU C Compiler rev 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to produce a MIPS binary. Then using the same data (the file gcc.c) we ran the benchmark on both machines and gathered the dynamic trace statistics provided by SPIXSTATS and PIXIE, the respective statistics gathering programs for SPARC and MIPS. We also measured the user and system time on both machines. The compiler optimization level was set at -O2 for MIPS (the highest that would compile) and at -O4 for SPARC. Both compilers were the standard production versions as of sept 1988. MIPS -O2 and SPARC -O4 are comparable levels of optimization. -O3 was the highest MIPS optimization available. From data on other C programs -O3 produces a gain of less than 2% on average over -O2, so we feel the comparison is valid. The following is divided into two sections. The first section covers a SPARC vs. MIPS instruction set architecture comparison and the second is an implementation comparison of the Sun-4/260 vs. the M/1000. The architecture comparison counts INSTRUCTIONS and is useful for comparing instruction sets and compiler efficiency. This will not vary across implementations if compilers are a constant. If you are interested in architecture and wish to avoid the confusion of implementation details these are the numbers of most interest. The implementation comparison counts CYCLES and includes effects like multi-cycle loads and cache misses etc. __________________________________________________________________ INSTRUCTION SET/REGISTER ARCHITECTURE AND COMPILER COMPARISON __________________________________________________________________ SPARC MIPS MIPS-SPARC Total Instructions 16,313,907 18,635,185 +2,321,278 ------------------------------------------------------------------ Detailed Breakdown ------------------------------------------------------------------ Branch nops 109,079 1,170,306 Load nops na 1,113,019 Jump nops 102,417 211,409 other nops 20,110 99,495 annulled delay slots (634,700) na load interlock cycles (1,474,619) na ------------------------------------------------------------------ nops sub-total 231,606 (1.4%) 2,594,229 (14%) +2,362,623 loads 3,242,293(19.9%)3,928,710(21%) +686,417 stores 1,175,530(7.2%) 2,037,266(10.9%)+861,736 conditional branches 2,699,885 2,559,648 unconditional branches 225,739 190,456 jumps 326,578 498,865 calls 214,662 213,118 ------------------------------------------------------------------ jmp/branch sub-total 3,466,864(21%) 3,462,087(18.5%)-4,777 shift 716,666 890,281 logical set cc 850,121 logical 1,396,645 1,335,473 arithmetic set cc 1,914,842 arithmetic 1,853,659 3,241,789 set na 666,309 save/restore 337,820 na others 84,094 41,838 ------------------------------------------------------------------ computational sub-total 7,192,084(44%) 6,175,690(33%) -1,016,394 sethi/lui 1,003,713(6.15%)437,207(2.3%) -566,506 ------------------------------------------------------------------ Some notes on the categories. MIPS "set" could be categorized as arithmetic or arithmetic set cc. SPARC "save/restore" could be categorized as arithmetic. They adjust the stack pointer and the increment/decrement window pointer; The equivalent MIPS operation is adjust the stack pointer. The SPARC nops listed as "other" are mostly associated with calls. The "others" category is mostly multiply related As will surprise most observers, SPARC executes fewer instructions than MIPS. Some specific observations. 1) SPARC has many fewer loads and stores,(1,548,153) which points out the significant ARCHITECTURAL advantage of register windows. Stated another way, for this case MIPS has 35% more loads and stores than SPARC. This benchmark contains more loads and stores than our "average" case of 15% loads and 6% stores so the benefits of register windows may actually be understated here. 2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad for code density, and it increases instruction cache miss penalties(due to more memory accesses and greater probability of a miss). A subtle point about the NOPs is that it distorts statistics presented as percentages. MIPS's combined load/store percentage is 32% for this benchmark. If there were no NOPs the percentage would be 37% vs. SPARC's 27%. Current SPARC implementations incur a clock cycle penalty for some of the cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS overstates the situation. This includes the load-use interlock case(1,474,619), and the untaken annulled branch case(634,700). While these cycles are not "architectural" many implementations will incur these penalties. The ARCHITECTURAL advantage that the annulling feature confers on SPARC probably needs more explanation. As the MIPS numbers demonstrate, it is difficult to fill branch delay slots. SPARC uses standard delayed branches until it cannot fill branch delay slots. It then uses annulling branches and fills almost all the remaining branch delay slots. Annulling branches that are taken incur no penalty and represent a performance win for SPARC that MIPS cannot realize. Minimizing the number of load interlock cycles and predicting conditional branches is a function of compiler technology. The load interlock cost could be around 1,000,000 cycles from a comparison with the MIPS number. The number of annulled instructions that incur a penalty is reduced with reasonable branch prediction. Several papers have shown that static branch prediction can get to 85% for C programs. Currently the Sun compiler gets 60% correct prediction for this benchmark. 85% prediction would reduce the untaken annulled branch cycles lost to 263,306. The bottom line about NOPs is SPARC is better due to the annulling ARCHITECTURE feature. 3) SPARC has more sethi instructions(566,506). Most of these are due to the way addresses to global data are generated by the compiler. An optimization that MIPS employs would eliminate these instructions. SPARC once performed the optimization (during early development) but we decided to keep the old a.out format and the old linker and so postponed the benefit. The SPARC ABI will allow us to remedy this situation. 4) The category that has the biggest discrepancy against SPARC is computational (1,016,394). Some of this is probably due to the need to set condition codes, an ARCHITECTURAL feature of SPARC, but it is not straightforward to analyze. 5) There are other significant ARCHITECTURAL differences between MIPS and SPARC that either are not represented in this benchmark or cannot be isolated with the data. I include this list for completeness. a) SPARC has a register + register addressing mode for loads and stores that MIPS lacks. b) MIPS has integer multiply and divide instructions that SPARC in lacks in current implementations. c) SPARC has load and store double operations(integer and floating point) MIPS has no equivalent instructions. d) MIPS has instructions to move data directly between the integer registers and the floating point registers. SPARC has no equivalent instructions. In summary, for this benchmark, the ARCHITECTURAL benefits of register windows and annulling more than balance the ARCHITECTURAL losses in computational. The relatively simple enhancements of sethi elimination, branch prediction and load interlock removal can buy more than 1,000,000 instructions for SPARC. From random inspection of code sequences the current SPARC compiler appears to produce redundant code, so some improvement can be expected in this area as well. For many observers the interesting fact is that for this benchmark, the MIPS compiler is not significantly better than the current SPARC compiler. Considering the bad press, I will admit I was surprised by this myself. Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY fundamentally better, but the degree of difference is probably in the noise in the broader scheme of things. IMPLEMENTATION ANALYSIS. This is mainly for historical perspective and to present a complete picture. ___________________________________________________________________________ User machine cycles comparison. ___________________________________________________________________________ Sun-4/280 MIPS M1000 instructions 16,313,907 18,635,185 loads (extra cycle) 3,242,293 stores (extra cycles) 2,351,060 load interlock (") 1,474,619 untaken branch (") 1,179,319 annulled cycles (") 634,700 jmp (") 326,578 mult/div (") na 363,987 basic block interlock? na 51,983 ----------------------------------------------------------- total raw cycles 25,522,476 18,999,172 cache miss cycles 4,427,524* 14,000,828* ----------------------------------------------------------- total machine cycles 29,950,000 33,000,000 ----------------------------------------------------------- CPI 1.84 1.77 CPUI(Cycles per Useful Instruction)** 1.86 2.02 MIPS 9.06 9.23 MUIPS(Millions of Useful Instructions/Sec)8.95 8.06 ___________________________________________________________________________ Rough Memory System Analysis ___________________________________________________________________________ memory references 20,731,730 24,601,161 (+3,869,431 +18.6%) average penalty 10 10??* misses/other 442,752(2.1%)* 1,400,083(5.7%)??* Benchmark Data Sun-4/280 MIPS M1000 clock 16.67MHz 15MHz user time 1.797secs 2.2secs system time .285secs .3secs * These are rough numbers working backwards from the time necessary to run the program and the clock frequency. The MIPS cache is write through and incurs significant penalties in write stalls. I cannot distinguish the magnitude of this effect here. ** Useful Instructions are all instructions not including NOPs. ______________________________________________________________________________ OPERATING SYSTEM OVERHEAD ______________________________________________________________________________ The time spent in the operating system is broadly comparable on both machines. Detailed analysis of how this breaks down is difficult. In current SPARC implementations window overflow/underflow is accomplished with trap handlers. MIPS currently handles TLB misses with trap handlers. The number of overflows for the Sun-4/280 (with 7 windows) was 4,439 and underflows 4,438, for a total of 8,877 traps. For SPARC the number of overflows and underflows is dependent on the number of register windows in an implementation. (e.g. A Cypress based design with 8 windows would have 2569 overflows and 2568 underflows for this program). Each overflow performs either 8 load doubles or eight store doubles. This is equivalent to 71,024 extra loads and 71,024 extra stores for the 4/280, a tiny fraction(3%) of the total loads and stores. If the TLB miss rate for MIPS was .1% (an optimistic assumption) this would have resulted in 24,601 traps. As an approximation both machines trap overheads appear comparable for this benchmark. Most of the system overhead is not in these trap handlers. For the 4/280 the overflow/underflow trap handlers take about 545,932 cycles out of the approximately 5,000,000 cycles of system time. I should clarify why I am treating underflow and overflow penalties in this section and not under architecture. As the numbers above show, nearly all aspects of underflow/overflow penalties are IMPLEMENTATION specific. The number of register windows and details of hardware or trap handler organization, all of which are determined by hardware or kernel implementations, are what account for this overhead. ______________________________________________________________________________ GENERAL IMPLEMENTATION COMMENTS ------------------------------------------------------------------------------ These numbers represent significant differences in the IMPLEMENTATION philosophies at Sun and at MIPS. The central goal at MIPS appears to have been to achieve a single cycle per instruction, even at the cost of cycle time and complexity. Clearly that was not a central goal at Sun. Most of the raw CPI differences are due to the multi-cycle loads and stores. This is due to the single 32-bit bus vs MIPS's multiplexed 32-bit bus. The single 32-bit bus was chosen for system simplicity. It also facilitates designing low cost systems and Multi-Processor systems. Our goals were dominated by cycle time and system simplicity. Performance on large programs was our design metric. The first SPARC implementation achieved a faster cycle time than the best of MIP's first implementations, despite inferior technology. The Cypress SPARC implementation is achieving a better cycle time than the latest MIPS implementation from Performance Semi.(33MHz vs 25Mhz). This is not co-incidental. Fujitsu has announced a new SPARC part for next year that will have multiple 64-bit busses that will demonstrate a good CPI and bury the myth that SPARC is tied to multi-cycle loads and stores. MIPS generates more memory references (18.6%,see above) than SPARC and the first implementations compounded this with poor cache/memory system design which means that large integer programs perform better overall on the SPARC implementation which has a better cache/memory system. The MIPS performance brief has concentrated on relatively small integer programs that fit in the cache and so benefit well from the single cycle loads and stores. This overstates the integer performance for large programs, which are after all what people buy fast machines to run. MIPS implicitly acknowledges this by calling the M1000 a 10 MIP box despite the fact that all the published data in the MIPS performance brief would say integer performance is greater than 12 MIPs. The performance brief also leans heavily on the floating point performance side where the first SPARC implementations are clearly inferior to the first MIPS implementations. This weakness was redressed by the parts announced by Cypress some time ago. As the data demonstrates, for a real and significant program, the Sun-4/280 is comparable to the M1000. The data also shows that for this program the SPARC instruction set and compiler duo are comparable to the MIPS instruction set and compiler duo. Ed Kelly The opinions here are my own and do not necessarily represent those of Sun Microsystems.