Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!vsi1!daver!mips!earl@wright.mips.com
From: earl@wright.mips.com (Earl Killian)
Newsgroups: comp.arch
Subject: Re: SPARC vs. MIPS on gcc
Keywords: SPARC, MIPS
Message-ID: <10574@wright.mips.COM>
Date: 3 Jan 89 17:06:28 GMT
References: <82150@sun.uucp> <697@hscfvax.harvard.edu> <677@helios.toronto.edu> <3790@druhi.ATT.COM> <10436@winchester.mips.COM>
Sender: earl@mips.COM
Organization: MIPS Computer Systems, Sunnyvale CA
Lines: 214

Ed Kelly of Sun studied gcc compiling gcc.c on MIPS and SPARC, and
posted some statistics together with his analysis and conclusions.  I
decided to take a look myself (also, it's a likely SPEC benchmark, so
understanding it will be useful).  At first I was unable to duplicate
Kelly's statistics.

gcc compiled on MIPS with cc -O3 and ran without hitch, whereas Kelly
said -O3 didn't work (-O4 also works if you fix a trivial bug in the
gcc source).  Subsequently we were told that Sun's -O3 problem was
that it ran out of space in /tmp on their machine and not a compiler
bug.

With -O3 I get 17.40M instructions.  At -O2, I get 17.82M instructions
instead of his 18.64M, so there was a big difference to explain.

The major difference between -O2 and -O3 is inter-procedural register
allocation.  A minor difference is that -O2 by default declines to
optimize "big" procedures (> 500 basic blocks) to save on compilation
time during program development.  It warns you by saying
	uopt: Warning: expand_expr: this procedure not optimized
	      because it exceeds size threshold; to optimize this
	      procedure, use -Olimit option with value >= 656.
For benchmarking, I go back and add a -Olimit to the Makefile and
recompile, just as the warning suggests.  If I leave off the the
-Olimit then several procedures remain unoptimized and the result
is 18.25M instructions.  Closer to Kelly's result, but still not
there.  (Note that two of the unoptimized procedures are yyparse and
yylex, which are the 2nd and 3rd heaviest contributors to CPU
cycles...)

Kelly was running this benchmark on a System V M/1000 as opposed to a
BSD M/1000 (MIPS sells both flavors of Unix).  When I tried it on
System V I got link errors for BSD-only routines such as bcopy and
bzero, which I solved by adding -lbsd to the command line.  My guess
is that Kelly didn't know about -lbsd and choose to use
straight-forward byte-at-a-time bcopy/bzero substitutes.  When I try
that I get 18.68M instructions, which is quite close to his result.

In summary:
18.68M	-O2, no opt of yyparse, yylex etc., no use of library bcopy/bzero
18.64M	posted number
18.25M	-O2, no opt of yyparse, yylex, etc.
17.82M	-O2, optimize yyparse, yylex, etc.
17.40M	-O3
(All results use the MIPS 1.31 compilers, which were released in Mid 88.)

The point of this was to show that Kelly's analysis was built on
questionable statistics.  But even with his statistics as a basis,
some of his conclusions are unwarranted.

As many people pointed out, gcc is only one data point, and it is
unreasonable to conclude anything from a single data point.  There
might be something anomalous in that one case, for example.

One thing I learned in porting gcc is that the MIPS compiler generates
poor code for a C construct that gcc uses heavily (a bit-field enum
that is both aligned and 16 bits in length).  Oh well, every compiler
has some simple things it doesn't bother to special case.  This will
be fixed in a future compiler release.  With that compiler the gcc
instruction count on Kelly's input is 16.47M instructions at -O3
(about 6% fewer instructions).  It is exactly this sort of sensitivity
to small details that make single data point conclusions unreliable.

It also turns out that 6% of the instruction cycles are spent in
printf etc.  I don't know whether the SPARC printf has been heavily
tuned or not; ours has not.  It is fair to include the cost of this as
a system test: that's what the user sees.  However, it is hard to draw
conclusions about Instruction Set Architecture (ISA) + Compilers,
where one is concerned about a % here or there, when noticeable parts
of the code are from libraries.

With those caveats in mind, let's look at some of Kelly's remarks:

   "As will surprise most observers, SPARC executes fewer instructions
   than MIPS."

This doesn't surprise me when I look closer and see how the
instruction counts differ.  After all, the RISC vs. CISC wars were
begun with the premise that instructions were only one term in the
performance equation.  Total performance is what matters.

As several people pointed out on the net, the difference in
instruction counts is primarily attributable to MIPS using a NOP
instruction instead of a hardware interlock for load instructions
(shifting responsibility from hardware to software).  With
interlocking, the load NOPs would be replaced by a single-cycle stall,
so the load NOPs have no direct performance impact (an indirect effect
is the increase in code size affects i-cache miss rates).  To
compensate for the difference in interlocking approach (hardware vs.
software), you can either subtract load nops (.91M) from the MIPS
counts or add SPARC interlocks (1.47M) to the SPARC counts.  With our
1.31 compilers, that makes the difference +-1% for adjusted
instruction count.  (With the compiler that optimizes aligned 16-bit
bit-fields to halfwords, it is 5 to 8% in favor of MIPS.)

But again, instruction counts aren't a good basis for comparison.  I
don't think you can compare ISAs without looking at implementations.
For example, MIPS has a divide instruction and SPARC has none.  Should
we add in our divide interlocks to be fair?  But a hypothetical MIPS
machine could have a 8-cycle divide, so maybe we ought to use 8, not
35, in ISA comparison?  How can this work?

In contrast, comparing cycles or time is more meaningful.  Kelly gives
25.52M as the raw cpu cycle count.  The corresponding MIPS number
(1.31 compilers) is 17.74M.  The large difference is of course due to
the Fujitsu SPARC chip using one extra cycle on loads, 2 extra cycles
on stores, and one extra cycle on untaken branches.

To go beyond cpu performance we need to pick a memory system.  This is
probably a good place to point out that the M/1000 Kelly used is a
lower performance machine than anything we now sell; it has been
essentially obsoleted by the 16.7MHz M/120 (like the M/1000, based on
the R2000) and the 25MHz M/2000 (based on the R3000), both of which
are in production and shipping.

Adding in cache miss cycles, Kelly gives a total of 29.95M cycles for
the Sun 4/280.  For the MIPS M/120 I get 24.19M (27.80M for the
M/1000).  Since the cycle time is the same for both the 4/280 and the
M/120, the cycle counts are directly related to time.

I don't think there's much to squabble about here.  Time is time.  All
the trade-offs have been reduced to a single number.  Kelly might
object that a hypothetical SPARC implementation might avoid the extra
load/store/branch cycles.  Such an implementation is said to be in
progress.  When it's appropriate, why not use it for comparison with
the corresponding MIPS system?

	"For many observers the interesting fact is that for this
   benchmark, the MIPS compiler is not significantly better than the
   current SPARC compiler.  Considering the bad press, I will admit I was
   surprised by this myself."

This statement was unsubstantiated; it is not obvious to me how to
compare compilers based on instruction statistics from different
architectures, especially on only one benchmark.  The few things that
do come to mind suggest that the MIPS compiler is doing a better job,
but given the importance of library code in this benchmark, the whole
subject is on thin ice.  Perhaps Kelly can elaborate?

	"Being a SPARC advocate I would claim that SPARC is
   ARCHITECTURALLY fundamentally better, but the degree of difference is
   probably in the noise in the broader scheme of things."

(-: Gee, being a MIPS advocate, and given the corrected numbers,
should I claim that MIPS ISA is 5-8% fundamentally better?
:-)

Kelly moves on to discuss the architecture of the entire system, not
just the ISA.  I have some quibbles with his methodology (e.g.
inferring anything from Unix runtimes on the order of 1-2 seconds,
where the error per measurement is probably 10% or more), but I really
have to restrict myself to addressing a few of his off-hand remarks
(this posting is already too long).

   "These numbers represent significant differences in the IMPLEMENTATION
   philosophies at Sun and at MIPS. The central goal at MIPS appears to
   have been to achieve a single cycle per instruction, even at the cost
   of cycle time and complexity. Clearly that was not a central goal at
   Sun."

Certainly single cycle execution was one of several MIPS goals, but I
would not say it was at expense of cycle time or complexity at all.
The most significant pressure on cycle time in the R2000 is due to
physical instead of virtual caches, not single-cycle execution.
Virtual caches simplify the CPU at the expense of multi-programming
performance and multi-processing implementation complexity.

	"Our goals were dominated by cycle time and system simplicity.
   Performance on large programs was our design metric.  The first
   SPARC implementation achieved a faster cycle time than the best of
   MIP's first implementations, despite inferior technology."

This is not true.  Both the Fujitsu SPARC and the R2000 are 16.7MHz
chips.  The M/1000 system, based on the R2000, was 15MHz instead of
16.7MHz because it used memory boards from the M/500 generation system
(you could upgrade with a cpu board replacement), and those memory
boards are good to 15MHz.  (The M/500 was introduced 18 months before
the Sun 4/260.)  Both MIPS and its customers ship systems based on the
R2000 at 16.7MHz (the M/1000 just isn't one of them.)

Is the Fujitsu SPARC implemented in an inferior technology to the
R2000?  That's hard to call.  The Fujitsu SPARC is implemented in what
is, I think, a 1.5 micron CMOS gate array technology whereas the R2000
is implemented in 2.0 micron custom CMOS technology.  I'm not sure how
to compare these particular apples and oranges.

	 "The MIPS performance brief has concentrated on relatively
   small integer programs that fit in the cache and so benefit well
   from the single cycle loads and stores."

The MIPS performance brief concentrates on large programs.  It is the
case that the large programs are floating point; large public domain
floating point programs are easier to find than large public domain
integer programs.  The UNIX commands listed in the Brief are at least
reasonably-sized real programs, not toys, and they're what a lot of
people use.  What about the Sun performance brief?  It relies on the
dhrystone and stanford benchmarks, which are much smaller than the
MIPS Unix suite.

   "This overstates the integer performance for large programs, which
   are after all what people buy fast machines to run. MIPS implicitly
   acknowledges this by calling the M1000 a 10 MIP box despite the
   fact that all the published data in the MIPS performance brief
   would say integer performance is greater than 12 MIPs."

Unlike Sun, but like DEC, we consider both floating point and integer
performance when assigning a VUPS (sometimes called MIPS) rating to
our machines.  And yes, we don't use toys like dhrystone and stanford
for our ratings (we give results because they're popular).  Read
section 2.1 of the MIPS performance brief for details.  Is there
something wrong with basing ratings on large, real programs?
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086