Path: utzoo!attcan!uunet!cs.utexas.edu!csd4.milw.wisc.edu!bionet!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: MIPS/MFLOPS ratio [long; here we go again; sorry] Message-ID: <22792@winchester.mips.COM> Date: 6 Jul 89 07:06:08 GMT References: <596@megatek.UUCP> <112807@sun.Eng.Sun.COM> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Inc. Lines: 405 1. INTRODUCTION In article <112807@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: 1) Some comments about SPARC integer-vs-floating point that seem to rewrite history from before when keith was at Sun, as well as some comments about Hot Chips that need some balancing comments (which you can take either as objective data, or as opposite-bias opinions; your call). 2) ``So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others.'' Marketing B.S. doesn't make something ("tilt") true; only being true makes it true; in any case, in my opinion, the logic (only if MIPS smarter is there no tilt towwards SPARC) is flawed, and I'll show why. ------- Some of this discussion inherently contains industry-oriented stuff, which I'm forced into, as well as some serious technical meat, thank goodness. If you don't like the former, hit "n" now. OUTLINE OF REST: 2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL 3. FP PRESENT, INCLUDING COMPILER ISSUES 4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS 4.1 WHAT KHB SAYS 4.2 WHAT MASH SAYS 4.3 HOT CHIPS, GENERAL 4.4 HOT CHIPS, CMOS FPU SESSION 4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN 2. KHB's MODEL OF SUN FP TRADEOFFS; ANOTHER MODEL >In article <596@megatek.UUCP> mark@megatek.UUCP () writes: >>This seems a little out of whack... it seems that older scientific >>processors had ratios in the 3-4 range. >Current SPARC implementations (chips and system) from Sun were >intended for "more general purpose use" hence the (relatively) narrow >gap between integer performance on a Cray to a 4/330. While floating >point is fun (and is typically my reason for existing on a project) I >spend most of my day doing compiles, editing, runing schedtool, and >other nonFP things. So using the 80-20 rule... the first machines >should be the ones we need 80% of the time. FACT: I admit to a nasty habit of keeping old marketing material and press clippings, which I believe predate khb's tenure at Sun; I often keep such things as a reality check. The following are quotes from the July 87 Sun-4 introductory material: ``Relative to other manufacturer's high-end offerings, the Sun-4/200 excels in floating-point performance. In fact, the Sun-4/200 will execute floating-point-intensive applications faster than the VAX 8800 superminicomputer.'' .... ``...giving users an overwhelming reason to migrate applications that currently run on super computers, minsupers, and superminis onto workstations.'' ``..first supercomputing workstation...'' ``Sun-4/200 Series is ideally suited for all compute-intensive, floating-point, or graphics-intensive applications. The primary markets targeted are high-end mechanical-CAD (MCAD) applications such as solids modeling and finite element analysis, electrical-CAD (ECAD) applications including IC and PC layout and routing; Artificial Intelligence (AI) development, earth resources, molecular modelling, and other compute-intensive applications.'' ``..ideal for applications in the scientific computing and electrical CAD markets.'' OPINION: FP not important?? Less important for Sun-4s?? OPINION: I think the original assertion (==VAX8800 FP) is probably true, if you replace Sun-4/200 (1987) by SPARCstation 3xx (1989). As pointed out shortly thereafter, the VAX 8700 and 8800 are NOT the same: 8800 has 2 8700 CPUs. It turned out that a Sun-4/200 was usually slower on many real FP applications than an 8700, (especially if using VMS compilers, which is what actually runs on most 8700/8800s). [OPINION] SS3xxs do appear to be better balanced than Sun-4/2xxs with regard to FP versus integer performance. 3. FP PRESENT, INCLUDING COMPILER ISSUES (....why people think MIPS FP is faster than SPARC FP...) >Compilers is often stated, but according to my weeks of staring at >huge volumes of data, it seems that the compiler differences are >minimal on large codes. The current sun compilers are somewhat less >clever about certain operations, but not enough to explain the >difference in performance. I suspect much of the code looks similar, which is not surprising, given the similarities of the register sets available at any one time, FP instruction sets that are fairly similar, and IEEE. At least one SPARC architectural difference was described by Tom Pennelo of Metaware at Hot Chips, but khb failed to mention: passing FP arguments in the integer registers, and not having direct moves to/from IU and FP, means that (in C, at least), saying y = glurp(x), with floats x,y, gives you something like: (x sitting in FP reg) store x to memory; load it to integer register z. call glurp store z to memory; reload it into FP reg; compute store result into memory, reload it into integer result reg return store result to memory; reload into FP reg (y) I have no idea how often this happens; fortunately for SPARC, FORTRAN is call-by-reference. Note also that conversions from int<->float go thru a similar drill (which is truly architectural, not architecture+ language convention, like the previous example, which, if not architectural is probably so wired into things it would be nontrivial to change.) The main reasons, I think, for the differences are: 1) The SPARC multi-cycle loads and stores, which is is not ISA, but SYSTEM architecture and implementation. 2) The MIPS FPUs have lower cycle counts. 3) The compiler thing is an open question; I haven't looked at much SPARC FP code lately, so I don't personally know. Maybe some UNBIASED third-parties would care to comment and give some DATA. 4. ANALYSIS OF "TILTING TOWARDS SPARC", INCLUDING HOT CHIPS 4.1 WHAT KHB SAYS >What is interesting is that the benchmarks which SPARC does worst on >are highly FP and memory intensive (say 30-50% loads and stores). (See the discussion on DP LINPACK later, which is actually one of the SS3xx and Sun-4/2xx's best FP benchmarks; SPARC systems have good external memory systems that are well-suited to memory-intensive applications.) >MIPSco built their own FPU and tightly coupled it to their IU. This >resulted in early units which were superior to the SPARC >implementation philosophy (let's buy whatever is laying around and >glue it in -- in the first implementations that meant a weitek 1164 >and 1165 and a controller ... "leftovers" from the sun3/fpa project). >At yesterday's IEEE HOT CHIPS conference, we were treated to three >papers about dedicated SPARC FPU's in addition to the papers focused >on FPU's BIT is already sampling ECL SPARC chips. So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others. 4.2 WHAT MASH SAYS Sigh. What does "tilting towards SPARC" mean? Does it mean that SPARC is getting ahead, or might be catching up ("tilting back towards parity")? I'm tired of this, but I can't let this argument go past.... I believe SPARC is getting closer, but that doesn't mean "tilting towards SPARC". There is nothing wrong, apriori with the SPARC implementation strategy (of using some existing FPU parts, and getting to market quickly), although calling the WTL parts "leftovers" might be a little Sun-centric view of the world, as those parts were used in plenty of other machines, including early MIPS M/500s (before R2010s existed). I'd use existing parts to get started, too; in fact, we did. The original SPARC team was small, and didn't have infinite resources, so this was all perfectly reasonable. In retrospect, [OPINION], the only problem was in not having somebody going like crazy to build a serious CMOS SPARC FPU early enough, and I have no idea whether somebody wanted to do this, and wasn't allowed to, or whether the partners didn't want to, or whether nobody had time to think about it at the right time, or what. Maybe we could be enlightened. In any case,the sequence is (with jiggles of a quarter possible on any date): MIPS SPARC 4Q86 WTL 116x in M/500 WTL 116x in Sun-3 2Q87 R2010 in M/500 socket, M/800 3Q87 WTL 116x in Sun-4 4Q87 R2010 in M/1000 2Q88 R2010 in M/120 4Q88 R3010 in M/2000 1Q89 2Q89 TI8847 in Sun-4 and SS300 WTL 3170 in SS1 4.3 HOT CHIPS, GENERAL 1) FACT: presentations at conferences are not deliveries of systems. 2) OPINION: The BIT+Sun ECL design looks well-done, with some reasonable and informed thinking in many places. Maybe before SPARC victory is declared by khb on the ECL front we maybe ought to wait for the first actual ECL systems to be shipped, and see how they run real programs. Anant Agrawal's talk was well-done, and mostly solid technical content (except for "World's first single chip ECL 32 bit processor" and "World's fastest microprocessor. 80MHz 12.5ns cycle." If you add "announced" to those, I might agree.:-) Despite such claims, it didn't give any SPECIFIC performance data (simulations of real programs)..... There was a good treatment of cache interface, although a few interesting parts (like actual cache and MMU designs, and getting enough fast enough SRAM hooked up) of building a complete system are Left To The Reader..... Khb might want to ask his his ECL colleagues about some of these issues. Still, this was a credible presentation and design, and for reasons that will be obvious sooner or later, there are more reasons for FP performance to be more similar than past designs. 3) OPINION: Pete Wilson's Prisma talk was delightful and fascinating; I admit that MIPS is not, to my knowledge, building a GaAs supercomputer of the $500K-$1M ilk, so I wish them well. 4) FACT: Solbourne did not present at the conference. Fujitsu referenced WTL 3170, but didn't otherwise talk about FP that I can recall. Cypress/Ross mentioned the CY7C602-FPU (which is, I think the same as the TI ....602). 5) That leaves LSI, TI; I guess Weitek is "all the rest", unless I missed somebody, which is possible. 4.4 HOT CHIPS, CMOS FPU SESSION khb: "treated to three papers" FACT: we had a session with 3 CMOS SPARC FPUs (Weitek, TI, LSI), followed by Earl Killian of MIPS. The session chair introduced Earl as someone who would not talk about a SPARC FPU. This comment elicited a noticable round of applause from the audience..... perhaps khb would comment on that reaction to a "treat". Now, the 3 CMOS SPARC FPU papers described reasonable devices, that in some cases include fairly clever things. On the other hand, we were given almost zero serious performance analysis, or motivational material to say why things were done differently; the LSIL presentation did include a cycle count comparison, which unfortunately was not included in the handouts, and I couldn't write it down fast enough, or I'd repeat it here. Presumably, if I were a SPARC customer, I might be able to get enough information on realistic usages and environments to figure out what programs would run faster with which chip combinations; such insight was NOT obvious from the presentations. Khb could do much to turn his comments into real DATA, and maybe thus offer a thesis that could be analyzed, if he would do the following: a) Gather all of the ACTUAL cycle counts of these various chips, and put them in a table like the LSIL speaker showed, and post it here. (This is data is clearly publicly available, I think.) b) Give a clear description of the overlap characteristics of these chips. I think most of them overlap {add/sub/conv, mul/div/sqrt, and load/store}, and I don't think any of them are pipelined, but I could be wrong. c) Give a terse, clear description of these chips in terms of which ones are used in which currently-public SPARC systems, and dispel any confusion about already-cited benchmark numbers. [When I read the trade press, I get confused, because they talk about things like shipping some SS1s with TI parts, but enough WTL parts are now available to use them instead, and I have no idea if that's press error, or real, and if real, what difference it would make.] d) If there REAL benchmarks, or even simulations of the performance of these things that exist somewhere public, point us at them. MIPS: Earl Killian described the R3010 FPU, including a large set of measured MFLOPS numbers [Livermore harmonic, geometric, arithmetic]; Gaussian Elimination [linpack, fortran, rolled, linpack hand-coded, 1000x1000], Matrix Multiply [50x50 handcoded], Multiply/Add Peak. (i.e., all numbers from the Performance Brief). He explained, with examples, why we chose used low-latency, multiple overlapped FP operational units (the R3010 appears to have somewhat more concurrency than some of the SPARC FPUs), rather than pipelined ones. He talked about simulation tradeoffs, like simulating Spice (and other large programs) with a tweakable simulator to examine the effects of different pipelining strategies and latency tradeoffs. He gave the cycle counts for most of the operations. He also observed, that although the 25MHz R3010 was shipped in production systems 8 months ago (almost a year ago @ 20MHz), and it was just a shrink of the R2010, which was shipped in production systems over 2 years ago, the CMOS SPARC FPUs still haven't caught up, even the forthcoming ones. [MASH: Or, at least, no compelling evidence was presented that they're going to blow it away, as there was a lot of talk of handcoded LINPACK inner loop peak performance, sometimes offered in tables comparing them with measured LINPACKs on real machines.... In fact, I think that only a few of the cycle counts on these parts are better than the corresponding R3010 ones. All of them suffer the (SPARC architectural) lack of direct data path between CPU & FPU. Again, if khb, or somebody would post the actual cycle counts, we can see whether my belief has any validity.) Now, somebody might claim [well, they do], that the forthcoming FPUs are targeted to 33 to 50MHz, (in some cases, people only listed the timings corresponding to these rates), and that they'll run faster than any R3010 ever will, AND THAT THEY'LL DO IT WHILE IT STILL MATTERS. Maybe they will, maybe they won't, but I'd suggest, that to add some credibility, I'd ask for the following DATA: 0) Talk about synchronizing the CPU and FPU at these speeds. Do you have PLL's, or some other technique, or magic? 1) What are the access times of the SRAMs needed to run at 30ns, 25ns, and 20ns cycle times? (Some of these parts were claimed to scale to 50Mhz, so the 20ns is relevant.) 2) What are the sizes, part-numbers, costs, and availability of those parts, and how many do you need? 3) What are the rest of the pieces that you need to run at those speeds? and when can you really get them? The only thing close to answering this question was the Cypress/Ross chipset description, and I'm not really sure what's happening there, simply because I have a hard time relating their chip dates to system dates. Basically, to use the RISCar metaphor, these are simple questions to see if a million-RPM engine can actually be put into a {buildable, sellable, maintainable} car, or whether the engine slows down. SPARC implementation combinations that I've heard of: 1) Fujitsu FPC + WTL 1164/65 (Sun-4/110, 200) (1987, 1988) 2) FPU2 (TI 8847+ FPC) for Sun-4/110,200 (1989) 3) WTL 3170 for LSIL/Fujitsu in SS1 (1989) 4) TI 8847+FPC in SS3xx (I think), with Cypress 601 IU (1989) 5) WTL 3171 (coming, to go with Cypress 601s) (1989) 6) TI TMS390C602 (coming) (which, I think really combines an 8847+FPC), to go with Cypress 601s (1989) 7) LSIL L64814 FPU, coming, which also goes with Cypress 601s, or the LSIL IU with that pinout rather than the LSIL SPARC IUs used in SS1s. (If I've missed anybody, I didn't mean to, and I'm sorry if I'm confused about any of these: please correct me if I'm wrong). BTW: as a side note to Sun: if you change FPUs in a system model, where it makes a performance difference, PLEASE consider giving it a succinct, different model number, or some identification, so people can know what they're measuring and label them correctly. The corresponding MIPS sequence is: 1) R2010, with R2000 (R2xxxAs are R3xxxs in R2xxx packages) (1987, 88) 2) R3010 (shrunken R2010) with R3000 (which was changed some) (88, 89) Keith is right: we're horribly outnumbered....still, in the CMOS world, nobody yet is shipping any SPARC systems that equal a 25MHz R3x pair at FP benchmarks, and in fact, the 25MHz SS300 (based on minimal data) looks not much different from a MIPS M/120, which has a 16.7MHz R2xxx pair. 4.5 "TILTING TOWARDS SPARC, UNLESS MIPS SMARTER THAN EVERYBODY" : UNPROVEN Now, I finally get to the comment that set all of this off: ``So the FPU >integration/implementation variable is tilting towards SPARC (unless >one assumes that MIPSco is smarter than Ross, Fuji.,BIT, LSI, TI, >Solb., Prisma and all the others.'' In order to bring sense from this, and to carefully avoid being misinterpreted, I'll recast this with some logic for clarity: A: "....is tilting towards SPARC." B: "MIPSco is smarter than ...." Now, khb's thesis may be rendered symbolicly as: not-A ==> B (i.e., that's what A, unless B means). not-B (I think: after reading this several times, I think the reader is being invited to disbelieve B as impossible, or to expect MIPSco to disprove A by proving B (which is impossible, there are smart peopel at lots of companies). khb does not SAY this, and if he didn't mean this, then you can ignore a lot of this. However, I have heard this syllogism before, so it's not new....] = not-(not-A) ==> A I claim that: 1) There is, as yet, little DELIVERABLE evidence for A, with the exception that SPARCland is ahead of MIPSland in GaAs supercomputers. The ECL verdict isn't in yet; so the rest of this discussion covers CMOS, only. [I've covered this somewhat above]. 2) Not (not-A ==> B), i.e., there could be plenty of reasons why A might not be true, without requiring B to be true. 4) C, where C: "MIPSco may be able to hold its own in these wars, based on past history, and on the requirements for doing so." Note that my claims are NOT, and should not be misconstrued as: 1) B (MIPSco is smarter) 2) E: where E is "MIPS will always be ahead, at every instant." Now, perhaps khb did not observe a difference in style or strategy amongst the {SPARC FPUs} vs {MIPS FPU} talks. I did observe some, and I add some other data, in defense of assertion C: [OPINION] Here's some of what it takes to build hot CMOS chips (& software they need, in a timely and competitive fashion, and especially for the next round (the integrated superchips): a) Good simulation/analysis methodology for looking at design alternatives. b) Close coupling of chip designers with systems designers, and smart sw folks: compiler folks: to answer questions like "if we make multiply X cycles, how much overlap can you get back with a smarter pipeline organizer?" OS & graphics folks, to answer all sorts of questions about memory hierarchy and other tradeoffs c) Smart chip designers; we like having logic and circuit folks sitting next to each other; others split it other ways. d) People who know CMOS technology, yield, reliability, testability, etc. e) CAD tools; diagnostics; design verification suites, etc, etc. f) A whole lot of computing power to support all of this. (like, the DV folks will use an infinite amount if you let them :-) g) Good chip technology and production. Now, only a few of these are "smart people"..... which is what makes the original khb thesis silly. To do well, you need to combine at least most of the above (not necessarily, or even usually, in one company, but at least in a team). OK, almost done. 1) I'm NOT claiming MIPSco is smarter than everybody else; I'm just arguing against the claim that the balance is on SPARC's side UNLESS MIPSco is smarter than everybody else. 2) There are plenty of reasons why competitive balance swings back and forth, and only some are smartness. 3) It really is boring having to respond to marketing FUD and rewritings of history in comp.arch. There are better things to do, and I'd much see discussion of things like (to pick a simple case): Which is better: 2-cycle + & 5-cycle *, or 3-cycle + & 4-cycle *? On which kinds of benchmarks? why? How much difference does it make in performance? in silicon space? I.e., things that give DATA, and even better INSIGHT........ 4) It would be nice to get some clear DATA posted about the forthcoming SPARC FPUs. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086