Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!aisling!edkelly
From: edkelly%aisling@Sun.COM (Ed Kelly)
Newsgroups: comp.arch
Subject: SPARC vs. MIPS on gcc
Keywords: SPARC, MIPS
Message-ID: <82150@sun.uucp>
Date: 17 Dec 88 01:31:48 GMT
Sender: news@sun.uucp
Reply-To: edkelly@sun.UUCP (Ed Kelly)
Organization: Sun Microsystems, Mountain View
Lines: 306


            A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.

For the comparison we chose a large portable C program (the GNU C Compiler rev
1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler
to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
produce a MIPS binary.
      Then using the same data (the file gcc.c) we ran the
benchmark on both machines and gathered the dynamic trace statistics provided 
by SPIXSTATS and PIXIE, the respective statistics gathering programs for SPARC 
and MIPS. We also measured the user and system time on both machines.   
       The compiler optimization level was set at -O2 for MIPS (the highest 
that would compile) and at -O4 for SPARC. Both compilers were the standard
production versions as of sept 1988.  MIPS -O2 and SPARC -O4 are comparable 
levels of optimization. -O3 was the highest MIPS optimization available. From
data on other C programs -O3 produces a gain of less than 2% on average
over -O2, so we feel the comparison is valid.
     The following is divided into two sections. The first section covers a
SPARC vs. MIPS instruction set architecture comparison and the second is an 
implementation comparison of the Sun-4/260 vs. the M/1000. The architecture
comparison counts INSTRUCTIONS and is  useful for comparing instruction sets 
and compiler efficiency. This will not vary across implementations if compilers
are a constant. If you are interested in architecture and wish to avoid the 
confusion of implementation details these are the numbers of most interest. 
The implementation comparison counts CYCLES and includes effects like
multi-cycle loads and cache misses etc.


__________________________________________________________________
  INSTRUCTION SET/REGISTER ARCHITECTURE AND COMPILER COMPARISON
__________________________________________________________________

			SPARC		MIPS		MIPS-SPARC

Total Instructions	16,313,907	18,635,185	+2,321,278
------------------------------------------------------------------
                     Detailed Breakdown
------------------------------------------------------------------
Branch nops		109,079		1,170,306
Load nops		na		1,113,019
Jump nops		102,417		211,409
other nops		20,110		99,495
annulled delay slots	(634,700)	na
load interlock cycles	(1,474,619)	na
------------------------------------------------------------------
nops sub-total		231,606	(1.4%)	2,594,229 (14%)	+2,362,623

loads			3,242,293(19.9%)3,928,710(21%)	+686,417

stores			1,175,530(7.2%)	2,037,266(10.9%)+861,736	

conditional branches	2,699,885	2,559,648
unconditional branches	225,739		190,456
jumps			326,578		498,865
calls			214,662		213,118
------------------------------------------------------------------
jmp/branch sub-total 	3,466,864(21%)	3,462,087(18.5%)-4,777


shift			716,666		890,281
logical set cc		850,121
logical			1,396,645       1,335,473
arithmetic set cc	1,914,842
arithmetic		1,853,659	3,241,789
set			na		666,309
save/restore		337,820		na		
others			84,094		41,838
------------------------------------------------------------------
computational sub-total	7,192,084(44%)	6,175,690(33%)	-1,016,394


sethi/lui		1,003,713(6.15%)437,207(2.3%)	-566,506
------------------------------------------------------------------

Some notes on the categories. 

	MIPS "set" could be categorized as arithmetic or arithmetic set cc.

	SPARC "save/restore" could be categorized as arithmetic. They 
	adjust the stack pointer and the increment/decrement window pointer; 
        The equivalent MIPS operation is adjust the stack pointer.

	The SPARC nops listed as "other" are mostly associated with 
	calls.
 
	The "others" category is mostly multiply related 

As will surprise most observers, SPARC executes fewer instructions than MIPS.

Some specific observations.

1) SPARC has many fewer loads and stores,(1,548,153) which points out the
significant ARCHITECTURAL advantage of register windows.  Stated another way,
for this case MIPS has 35% more loads and stores than SPARC. This benchmark 
contains more loads and stores than our "average" case of 15% loads and 6% 
stores so the benefits of register windows may actually be understated here. 

2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. 
NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad 
for code density, and it increases instruction cache miss penalties(due to more
memory accesses and greater probability of a miss).
       A subtle point about the NOPs is that it distorts statistics presented as
percentages. MIPS's combined load/store percentage is 32% for this benchmark. 
If there were no NOPs the percentage would be 37% vs. SPARC's 27%. 
     Current SPARC implementations incur a clock cycle penalty for some of the
cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS 
overstates the situation. This includes the load-use interlock case(1,474,619),
and the untaken annulled branch case(634,700). While these cycles are not 
"architectural" many implementations will incur these penalties.
     The ARCHITECTURAL advantage that the annulling feature confers on SPARC 
probably needs more explanation.  As the MIPS numbers demonstrate, it is 
difficult to fill branch delay slots. SPARC uses standard delayed branches 
until it cannot fill branch delay slots. It then uses annulling branches and 
fills almost all the remaining branch delay slots. Annulling branches that are 
taken incur no penalty and represent a performance win for SPARC that MIPS 
cannot realize. 
           Minimizing the number of load interlock cycles and predicting 
conditional branches is a function of compiler technology. The load interlock 
cost could be around 1,000,000 cycles from a comparison with the MIPS number. 
The number of annulled instructions that incur a penalty is reduced with 
reasonable branch prediction. Several papers have shown that static branch 
prediction can get to 85% for C programs. Currently the Sun compiler gets 60% 
correct prediction for this benchmark. 85% prediction would reduce the untaken 
annulled branch cycles lost to 263,306.
     The bottom line about NOPs is SPARC is better due to the annulling 
ARCHITECTURE feature.

3) SPARC has more sethi instructions(566,506). Most of these are due to the way 
addresses to global data are generated by the compiler. An optimization 
that MIPS employs would eliminate these instructions. SPARC once performed the 
optimization (during early development) but we decided to keep the old a.out 
format and the old linker and so postponed the benefit. The SPARC ABI will 
allow us to remedy this situation.

4)  The category that has the biggest discrepancy against SPARC is computational
(1,016,394). Some of this is probably due to the need to set condition codes,
an ARCHITECTURAL feature of SPARC, but it is not straightforward to analyze. 

5) There are other significant ARCHITECTURAL differences between MIPS and
   SPARC that either are not represented in this benchmark or cannot be
   isolated with the data. I include this list for completeness.
 
 a)  SPARC has a register + register addressing mode for loads and stores that 
    MIPS lacks. 

 b) MIPS has integer multiply and divide instructions that SPARC in lacks
    in current implementations. 

 c) SPARC has load and store double operations(integer and floating point)
     MIPS has no equivalent instructions.
 
 d) MIPS has instructions to move data directly between the integer registers
    and the floating point registers. SPARC has no equivalent instructions.


In summary, for this benchmark, the ARCHITECTURAL benefits of register windows 
and annulling more than balance the ARCHITECTURAL losses in computational. The
relatively simple enhancements of sethi elimination, branch prediction and 
load interlock removal can buy more than 1,000,000 instructions for SPARC. 
From random inspection of code sequences the current SPARC compiler appears to 
produce redundant code, so some improvement can be expected in this area as 
well.
      For many observers the interesting fact is that for this benchmark, the 
MIPS compiler is not significantly better than the current SPARC compiler. 
Considering the bad press, I will admit I was surprised by this myself. 
             Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY
fundamentally better, but the degree of difference is probably in the noise in 
the broader scheme of things.


                        IMPLEMENTATION ANALYSIS.

This is mainly for historical perspective and to present a complete picture.

___________________________________________________________________________
			User machine cycles comparison.
___________________________________________________________________________

			Sun-4/280		MIPS M1000

instructions		16,313,907 		18,635,185
loads (extra cycle)	3,242,293
stores (extra cycles)	2,351,060
load interlock	 (")	1,474,619
untaken branch	 (")	1,179,319
annulled cycles	 (")	634,700
jmp		 (")	326,578
mult/div	 (")	na			363,987
basic block interlock?	na			51,983
-----------------------------------------------------------
total raw cycles	25,522,476		18,999,172

cache miss cycles	4,427,524*		14,000,828*
-----------------------------------------------------------
total machine cycles	29,950,000		33,000,000
-----------------------------------------------------------
CPI			1.84			1.77
CPUI(Cycles per Useful
Instruction)**		1.86			2.02
MIPS			9.06			9.23
MUIPS(Millions of
Useful Instructions/Sec)8.95			8.06


___________________________________________________________________________
  			Rough Memory System Analysis
___________________________________________________________________________
memory references	20,731,730		24,601,161 (+3,869,431 +18.6%)
average penalty		10			10??*
misses/other		442,752(2.1%)*		1,400,083(5.7%)??*


Benchmark Data
			Sun-4/280		MIPS M1000
clock			16.67MHz		15MHz
user time		1.797secs		2.2secs
system time		.285secs		.3secs

   * These are rough numbers working backwards from the time necessary to run
     the program and the clock frequency. The MIPS cache is write through and
     incurs significant penalties in write stalls. I cannot distinguish the 
     magnitude of this effect here.

   ** Useful Instructions are all instructions not including NOPs.


______________________________________________________________________________
                        OPERATING SYSTEM OVERHEAD
______________________________________________________________________________

The time spent in the operating system is broadly comparable on both machines. 
Detailed analysis of how this breaks down is difficult. In current SPARC
implementations window overflow/underflow is accomplished with trap handlers. 
MIPS currently handles TLB misses with trap handlers. 
     The number of overflows for the Sun-4/280 (with 7 windows) was 4,439 
and underflows 4,438, for a total of 8,877 traps. For SPARC the number of 
overflows and underflows is dependent on the number of register windows in an 
implementation. (e.g. A Cypress based design with 8 windows would have 
2569 overflows and 2568 underflows for this program). Each overflow performs 
either 8 load doubles or eight store doubles. This is equivalent to 71,024 
extra loads and 71,024 extra stores for the 4/280, a tiny fraction(3%) of the 
total loads and stores.
 
     If the TLB miss rate for MIPS was .1% (an optimistic assumption) 
this would have resulted in 24,601 traps. As an approximation both machines
trap overheads appear comparable for this benchmark. Most of the system 
overhead is not in these trap handlers. For the 4/280 the overflow/underflow 
trap handlers take about 545,932 cycles out of the approximately 5,000,000 
cycles of system time. 

     I should clarify why I am treating underflow and overflow penalties
in this section and not under architecture. As the numbers above show, nearly 
all aspects of underflow/overflow penalties are IMPLEMENTATION specific. The
number of register windows and details of hardware or trap handler organization,
all of which are determined by hardware or kernel implementations, are what
account for this overhead.

______________________________________________________________________________
			GENERAL IMPLEMENTATION COMMENTS
------------------------------------------------------------------------------

These numbers represent significant differences in the IMPLEMENTATION
philosophies at Sun and at MIPS. The central goal at MIPS appears to have been
to achieve a single cycle per instruction, even at the cost of cycle time and
complexity. Clearly that was not a central goal at Sun. 
    Most of the raw CPI differences are due to the multi-cycle loads and stores.
This is due to the single 32-bit bus vs MIPS's multiplexed 32-bit bus. The
single 32-bit bus was chosen for system simplicity. It also facilitates
designing low cost systems and Multi-Processor systems. 
     Our goals were dominated by cycle time and system simplicity. Performance
on large programs was our design metric.
The first SPARC implementation achieved a faster cycle time than the best
of MIP's first implementations, despite inferior technology. The Cypress SPARC
implementation is achieving a better cycle time than the latest MIPS 
implementation from Performance Semi.(33MHz vs 25Mhz). This is not 
co-incidental. Fujitsu has announced a new SPARC part for next year that will 
have multiple 64-bit busses that will demonstrate a good CPI and bury the myth
that SPARC is tied to multi-cycle loads and stores.
      MIPS generates more memory references (18.6%,see above) than SPARC and 
the first implementations compounded this with poor cache/memory system design 
which means that large integer programs perform better overall on the SPARC 
implementation which has a better cache/memory system.
      The MIPS performance brief has concentrated on relatively small 
integer programs that fit in the cache and so benefit well from the single cycle
loads and stores. This overstates the integer performance for large programs,
which are after all what people buy fast machines to run. MIPS implicitly
acknowledges this by calling the M1000 a 10 MIP box despite the fact that all
the published data in the MIPS performance brief would say integer performance
is greater than 12 MIPs. The performance brief also leans heavily on the 
floating point performance side where the first SPARC implementations are 
clearly inferior to the first MIPS implementations. This weakness was 
redressed by the parts announced by Cypress some time ago.

    As the data demonstrates, for a real and significant program, the Sun-4/280
is comparable to the M1000. The data also shows that for this program the 
SPARC instruction set and compiler duo are comparable to the MIPS instruction 
set and compiler duo.

Ed Kelly

The opinions here are my own and do not necessarily represent those of
Sun Microsystems.