Path: utzoo!attcan!uunet!cs.utexas.edu!csd4.milw.wisc.edu!bionet!ames!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: MIPS/MFLOPS ratio [long; here we go again; sorry]
Message-ID: <22792@winchester.mips.COM>
Date: 6 Jul 89 07:06:08 GMT
References: <596@megatek.UUCP> <112807@sun.Eng.Sun.COM>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 405

1. INTRODUCTION

In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:

1) Some comments about SPARC integer-vs-floating point that seem to
rewrite history from before when keith was at Sun, as well as some comments
about Hot Chips that need some balancing comments (which you can take either as
objective data, or as opposite-bias opinions; your call).

2) ``So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.''
Marketing B.S. doesn't make something ("tilt") true; only being true makes
it true; in any case, in my opinion, the logic (only if MIPS smarter is there
no tilt towwards SPARC) is flawed, and I'll show why.
-------
Some of this discussion inherently contains industry-oriented stuff,
which I'm forced into, as well as some serious technical meat, thank goodness.
If you don't like the former, hit "n" now.

OUTLINE OF REST:
2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL
3. FP  PRESENT, INCLUDING COMPILER ISSUES
4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS
4.1 WHAT KHB SAYS
4.2 WHAT MASH SAYS
4.3 HOT CHIPS, GENERAL
4.4 HOT CHIPS, CMOS FPU SESSION
4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN

2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL
>In article <596@megatek.UUCP> mark@megatek.UUCP () writes:

>>This seems a little out of whack... it seems that older scientific
>>processors had ratios in the 3-4 range.

>Current SPARC implementations (chips and system) from Sun were
>intended for "more general purpose use" hence the (relatively) narrow
>gap between integer performance on a Cray to a 4/330. While floating
>point is fun (and is typically my reason for existing on a project) I
>spend most of my day doing compiles, editing, runing schedtool, and
>other nonFP things. So using the 80-20 rule... the first machines
>should be the ones we need 80% of the time.

FACT:  I admit to a nasty habit of keeping old marketing material
and press clippings, which I believe predate khb's tenure at Sun;
I often keep such things as a reality check.

The following are quotes from the July 87 Sun-4 introductory material:
``Relative to other manufacturer's high-end offerings,
the Sun-4/200 excels in floating-point performance.
In fact, the Sun-4/200 will execute floating-point-intensive applications
faster than the VAX 8800 superminicomputer.'' ....
``...giving users an overwhelming reason to migrate applications that
currently run on super computers, minsupers, and superminis onto workstations.''
``..first supercomputing workstation...''
``Sun-4/200 Series is ideally suited for all compute-intensive, floating-point,
or graphics-intensive applications.  The primary markets targeted are high-end
mechanical-CAD (MCAD) applications such as solids modeling and finite element
analysis, electrical-CAD (ECAD) applications including IC and PC layout
and routing; Artificial Intelligence (AI) development, earth resources,
molecular modelling, and other compute-intensive applications.''
``..ideal for applications in the scientific computing and electrical CAD
markets.''

OPINION: FP not important?? Less important for Sun-4s??

OPINION: I think the original assertion (==VAX8800 FP) is probably true, if you
replace Sun-4/200 (1987) by SPARCstation 3xx (1989). As pointed out shortly
thereafter, the VAX 8700 and 8800 are NOT the same: 8800 has 2 8700 CPUs.
It turned out that a Sun-4/200 was usually slower on many real
FP applications than an 8700, (especially if using VMS compilers, which is
what actually runs on most 8700/8800s).  [OPINION] SS3xxs do appear to be better
balanced than Sun-4/2xxs with regard to FP versus integer performance.

3. FP  PRESENT, INCLUDING COMPILER ISSUES
(....why people think MIPS FP is faster than SPARC FP...)
>Compilers is often stated, but according to my weeks of staring at
>huge volumes of data, it seems that the compiler differences are
>minimal on large codes. The current sun compilers are somewhat less
>clever about certain operations, but not enough to explain the
>difference in performance.

I suspect much of the code looks similar, which is not surprising,
given the similarities of the register sets available at any one time,
FP instruction sets that are fairly similar, and IEEE.
At least one SPARC architectural difference was described by Tom Pennelo of
Metaware at Hot Chips, but khb failed to mention:
passing FP arguments in the integer registers, and not having
direct moves to/from IU and FP, means that (in C, at least),
saying y = glurp(x), with floats x,y, gives you something like:

(x sitting in FP reg)
	store x to memory; load it to integer register z.
call glurp
	store z to memory; reload it into FP reg; compute
	store result into memory, reload it into integer result reg
return
	store result to memory; reload into FP reg (y)
I have no idea how often this happens; fortunately for SPARC, FORTRAN is
call-by-reference.  Note also that conversions from int<->float go
thru a similar drill (which is truly architectural, not architecture+
language convention, like the previous example, which, if not architectural
is probably so wired into things it would be nontrivial to change.)

The main reasons, I think, for the differences are:
	1) The SPARC multi-cycle loads and stores, which is is not ISA,
		but SYSTEM architecture and implementation.
	2) The MIPS FPUs have lower cycle counts.
	3) The compiler thing is an open question; I haven't looked at
	much SPARC FP code lately, so I don't personally know.  Maybe
	some UNBIASED third-parties would care to comment and give some DATA.

4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS
4.1 WHAT KHB SAYS
>What is interesting is that the benchmarks which SPARC does worst on
>are highly FP and memory intensive (say 30-50% loads and stores).
(See the discussion on DP LINPACK later, which is actually one of the
SS3xx and Sun-4/2xx's best FP benchmarks; SPARC systems have good external
memory systems that are well-suited to memory-intensive applications.)

>MIPSco built their own FPU and tightly coupled it to their IU. This
>resulted in early units which were superior to the SPARC
>implementation philosophy (let's buy whatever is laying around and
>glue it in -- in the first implementations that meant a weitek 1164
>and 1165 and a controller ... "leftovers" from the sun3/fpa project).
>At yesterday's IEEE HOT CHIPS conference, we were treated to three
>papers about dedicated SPARC FPU's in addition to the papers focused
>on FPU's BIT is already sampling ECL SPARC chips. So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.

4.2 WHAT MASH SAYS
Sigh.  What does "tilting towards SPARC" mean?  Does it mean that
SPARC is getting ahead, or might be catching up ("tilting back towards
parity")?  I'm tired of this, but I can't let this
argument go past.... I believe SPARC is getting closer, but that doesn't
mean "tilting towards SPARC".

There is nothing wrong, apriori with the SPARC implementation strategy
(of using some existing FPU parts, and getting to market quickly),
although calling the WTL parts "leftovers" might be a little
Sun-centric view of the world, as those parts were used in plenty of
other machines, including early MIPS M/500s (before R2010s existed).
I'd use existing parts to get started, too; in fact, we did.
The original SPARC team was small, and didn't have infinite resources,
so this was all perfectly reasonable.  In retrospect, [OPINION], the
only problem was in not having somebody going like crazy to build a
serious CMOS SPARC FPU early enough, and I have no idea whether somebody
wanted to do this, and wasn't allowed to, or whether the partners didn't
want to, or whether nobody had time to think about it at the right time,
or what.  Maybe we could be enlightened.

In any case,the sequence is (with jiggles of a quarter possible on any date):
	MIPS				SPARC
4Q86	WTL 116x in M/500		WTL 116x in Sun-3
2Q87	R2010 in M/500 socket, M/800
3Q87					WTL 116x in Sun-4
4Q87	R2010 in M/1000
2Q88	R2010 in M/120
4Q88	R3010 in M/2000
1Q89
2Q89					TI8847 in Sun-4 and SS300
					WTL 3170 in SS1

4.3 HOT CHIPS, GENERAL
1) FACT: presentations at conferences are not deliveries of systems.

2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable
and informed thinking in many places.  Maybe before SPARC victory is declared
by khb on the ECL front we maybe ought to wait for the first actual ECL systems
to be shipped, and see how they run real programs.  Anant Agrawal's talk
was well-done, and mostly solid technical content (except for "World's
first single chip ECL 32 bit processor" and "World's fastest microprocessor.
80MHz 12.5ns cycle."  If you add "announced" to those, I might agree.:-)
Despite such claims, it didn't give any SPECIFIC performance data (simulations
of real programs).....  There was a good treatment of cache interface,
although a few interesting parts (like actual cache and MMU designs,
and getting enough fast enough SRAM hooked up) of building a
complete system are Left To The Reader.....  Khb might want to ask his
his ECL colleagues about some of these issues.  Still, this was a credible
presentation and design, and for reasons that will be obvious sooner or
later, there are more reasons for FP performance to be more similar
than past designs.

3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit
that MIPS is not, to my knowledge, building a GaAs supercomputer of
the $500K-$1M ilk, so I wish them well.

4) FACT: Solbourne did not present at the conference.
Fujitsu referenced WTL 3170, but didn't otherwise talk about FP that I can
recall.  Cypress/Ross mentioned the CY7C602-FPU (which is, I think the same as
the TI ....602).

5) That leaves LSI, TI; I guess Weitek is "all the rest", unless
I missed somebody, which is possible.

4.4 HOT CHIPS, CMOS FPU SESSION
khb: "treated to three papers"

FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI),
followed by Earl Killian of MIPS. The session chair introduced Earl as someone
who would not talk about a SPARC FPU.  This comment elicited a
noticable round of applause from the audience..... perhaps khb would
comment on that reaction to a "treat".

Now, the 3 CMOS SPARC FPU papers described reasonable devices,
that in some cases include fairly clever things.  On the other hand,
we were given almost zero serious performance analysis,
or motivational material to say why things were done differently;
the LSIL presentation did include a cycle count comparison, which unfortunately
was not included in the handouts, and I couldn't write it down fast enough,
or I'd repeat it here.  Presumably, if I were a SPARC customer, I might be able
to get enough information on realistic usages and environments to figure
out what programs would run faster with which chip combinations;
such insight was NOT obvious from the presentations.

Khb could do much to turn his comments into real DATA,
and maybe thus offer a thesis that could be analyzed, if he
would do the following:
	a) Gather all of the ACTUAL cycle counts of these various chips,
	and put them in a table like the LSIL speaker showed, and post it here.
	(This is data is clearly publicly available, I think.)
	b) Give a clear description of the overlap characteristics of these
	chips.  I think most of them overlap {add/sub/conv, mul/div/sqrt, and
	load/store}, and I don't think any of them are pipelined, but I
	could be wrong.
	c) Give a terse, clear description of these chips in terms of which
	ones are used in which currently-public SPARC systems, and dispel any
	confusion about already-cited benchmark numbers.  [When I read the
	trade press, I get confused, because they talk about things like
	shipping some SS1s with TI parts, but enough WTL parts are now available
	to use them instead, and I have no idea if that's press error, or real,
	and if real, what difference it would make.]
	d) If there REAL benchmarks, or even simulations of the performance
	of these things that exist somewhere public, point us at them.

MIPS:
	Earl Killian described the R3010 FPU, including a large set of measured
	MFLOPS numbers [Livermore harmonic, geometric, arithmetic];
	Gaussian Elimination [linpack, fortran, rolled, linpack hand-coded,
	1000x1000], Matrix Multiply [50x50 handcoded], Multiply/Add Peak.
	(i.e., all numbers from the Performance Brief).
	He explained, with examples, why we chose used low-latency, multiple
	overlapped FP operational units  (the R3010 appears to have
	somewhat more concurrency than some of the SPARC FPUs), rather than
	pipelined ones.  He talked about simulation tradeoffs, like
	simulating Spice (and other large programs) with a tweakable
	simulator to examine the effects of different pipelining
	strategies and latency tradeoffs.  He gave the cycle counts
	for most of the operations.
He also observed, that although the 25MHz R3010 was shipped in production
systems 8 months ago (almost a year ago @ 20MHz), and it was just a shrink
of the R2010, which was shipped in production systems over
2 years ago, the CMOS SPARC FPUs still haven't caught up, even the
forthcoming ones.
[MASH: Or, at least, no compelling evidence
was presented that they're going to blow it away, as there was a lot of talk
of handcoded LINPACK inner loop peak performance, sometimes offered
in tables comparing them with measured LINPACKs on real machines....
In fact, I think that only a few of the cycle counts on these
parts are better than the corresponding R3010 ones.  All of them suffer the
(SPARC architectural) lack of direct data path between CPU & FPU.
Again, if khb, or somebody would post the actual cycle counts, we can see
whether my belief has any validity.)

Now, somebody might claim [well, they do], that the forthcoming
FPUs are targeted to 33 to 50MHz, (in some cases, people only listed the
timings corresponding to these rates), and that they'll run faster than
any R3010 ever will, AND THAT THEY'LL DO IT WHILE IT STILL MATTERS.
Maybe they will, maybe they won't, but I'd suggest, that to add some
credibility, I'd ask for the following DATA:
	0) Talk about synchronizing the CPU and FPU at these speeds.
	Do you have PLL's, or some other technique, or magic?
	1) What are the access times of the SRAMs needed to
	run at 30ns, 25ns, and 20ns cycle times? (Some of these parts
	were claimed to scale to 50Mhz, so the 20ns is relevant.)
	2) What are the sizes, part-numbers, costs, and availability
	of those parts, and how many do you need? 
	3) What are the rest of the pieces that you need to
	run at those speeds?  and when can you really get them?
The only thing close to answering this question was the Cypress/Ross
chipset description, and I'm not really sure what's happening there,
simply because I have a hard time relating their chip dates to system dates.

Basically, to use the RISCar metaphor, these are simple questions
to see if a million-RPM engine can actually be put into a
{buildable, sellable, maintainable} car, or whether the engine slows down.

SPARC implementation combinations that I've heard of:
	1) Fujitsu FPC + WTL 1164/65 (Sun-4/110, 200) (1987, 1988)
	2) FPU2 (TI 8847+ FPC) for Sun-4/110,200 (1989)
	3) WTL 3170 for LSIL/Fujitsu in SS1 (1989)
	4) TI 8847+FPC in SS3xx (I think), with Cypress 601 IU (1989)
	5) WTL 3171 (coming, to go with Cypress 601s) (1989)
	6) TI TMS390C602 (coming) (which, I think really combines an 8847+FPC),
	to go with Cypress 601s (1989)
	7) LSIL L64814 FPU, coming, which also goes with Cypress 601s, or the
	LSIL IU with that pinout rather than the LSIL SPARC IUs used in SS1s.
(If I've missed anybody, I didn't mean to, and I'm sorry if I'm confused
about any of these: please correct me if I'm wrong).
BTW: as a side note to Sun: if you change FPUs in a system model, where it makes
a performance difference, PLEASE consider giving it a succinct, different
model number, or some identification, so people can know what they're
measuring and label them correctly. 

The corresponding  MIPS sequence is:
	1) R2010, with R2000  (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88)
	2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89)

Keith is right: we're horribly outnumbered....still, in the CMOS
world, nobody yet is shipping any SPARC systems that equal a 25MHz R3x pair at
FP benchmarks, and in fact, the 25MHz SS300 (based on minimal data) looks
not much different from a MIPS M/120, which has a 16.7MHz R2xxx pair.
	
4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN
Now, I finally get to the comment that set all of this off: ``So the FPU
>integration/implementation variable is tilting towards SPARC (unless
>one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI,
>Solb., Prisma and all the others.''

In order to bring sense from this, and to carefully avoid being
misinterpreted, I'll recast this with some logic for clarity:
	A: "....is tilting towards SPARC."
	B: "MIPSco is smarter than ...."
Now, khb's thesis may be rendered symbolicly as:
	not-A ==> B  (i.e., that's what A, unless B means).
	not-B (I think: after reading this several times, I think the
		reader is being invited to disbelieve B as impossible,
		or to expect MIPSco to disprove A by proving B (which
		is impossible, there are smart peopel at lots of companies).
		khb does not SAY this, and if he didn't mean this,
		then you can ignore a lot of this.  However, I have heard
		this syllogism before, so it's not new....]
	= not-(not-A) ==> A

I claim that:
	1) There is, as yet, little DELIVERABLE evidence for A,
	with the exception that SPARCland is ahead of MIPSland in GaAs
	supercomputers.  The ECL verdict isn't in yet; so the rest of
	this discussion covers CMOS, only.
		[I've covered this somewhat above].
	2) Not (not-A ==> B), i.e., there could be plenty of reasons
	why A might not be true, without requiring B to be true.
	4) C, where C: "MIPSco may be able to hold its own in these wars,
		based on past history, and on the requirements for doing so."

Note that my claims are NOT, and should not be misconstrued as:
	1) B (MIPSco is smarter)
	2) E: where E is "MIPS will always be ahead, at every instant."

Now, perhaps khb did not observe a difference in style or strategy
amongst the {SPARC FPUs} vs {MIPS FPU} talks.  I did observe some,
and I add some other data, in defense of assertion C:

[OPINION] Here's some of what it takes to build hot CMOS chips (& software
they need, in a timely and competitive fashion, and especially for the next
round (the integrated superchips):

a) Good simulation/analysis methodology for looking at design alternatives.
b) Close coupling of chip designers with systems designers, and smart sw folks:
	compiler folks: to answer questions  like "if we make multiply
		X cycles, how much overlap can you get back with a smarter
		pipeline organizer?" 
	OS & graphics folks, to answer all sorts of questions about
		memory hierarchy and other tradeoffs
c) Smart chip designers; we like having logic and circuit folks sitting next
	 to each other; others split it other ways.
d) People who know CMOS technology, yield, reliability, testability, etc.
e) CAD tools; diagnostics; design verification suites, etc, etc.
f) A whole lot of computing power to support all of this.
	(like, the DV folks will use an infinite amount if you let them :-)
g) Good chip technology and production.

Now, only a few of these are "smart people"..... which is what makes
the original khb thesis silly.  To do well, you need to combine at least
most of the above (not necessarily, or even usually, in one company,
but at least in a team).  
		
OK, almost done.
1) I'm NOT claiming MIPSco is smarter than everybody else;
I'm just arguing against the claim that the balance is on SPARC's side
UNLESS MIPSco is smarter than everybody else.

2) There are plenty of reasons why competitive balance swings
back and forth, and only some are smartness.

3) It really is boring having to respond to marketing FUD and
rewritings of history in comp.arch.  There are better things to do, and I'd much
see discussion of things like (to pick a simple case):
	Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *?
	On which kinds of benchmarks? why?
	How much difference does it make in performance? in silicon space?

I.e., things that give DATA, and even better INSIGHT........

4) It would be nice to get some clear DATA posted about the forthcoming
SPARC FPUs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086