Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!think!nike!oliveb!glacier!mips!mash From: mash@mips.UUCP (John Mashey) Newsgroups: net.arch Subject: Re: Floating point performance Message-ID: <725@mips.UUCP> Date: Thu, 16-Oct-86 00:03:13 EDT Article-I.D.: mips.725 Posted: Thu Oct 16 00:03:13 1986 Date-Received: Thu, 16-Oct-86 22:55:43 EDT References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP> <8184@sun.uucp> Reply-To: mash@mips.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 133 In article <8184@sun.uucp> dgh@sun.uucp (David Hough) writes: > > Mflops Per MHz > I'd like to add to John Mashey's recent posting about floating- >point performance.... Thanks; as I'd said, parentage of numbers was suspect, so it's good to see some I trust some more. > >Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz > >Sun-3/160+68881 16.7 16.7 955 60 60 Oops, I'd thought you guys used 12.5Mhz 68881s at one point [but I checked the current literature and it says no. Has it changed recently? > .... Whetstone >benchmark performance is mostly determined by the following code: > (bunch of code) ... > >On Weitek 1164/1165-based systems, execution time for the P3 loop is >dominated by the division operation... >something far more interesting: the best way to improve Whetstone per- >formance is to do enough global inter-procedural optimization in your >compiler to determine that P3 only needs to be called once. This >gives a 2X performance increase with no hardware work at all! One MIPS >paper suggests that the MIPS compiler does this or something similar. Actually, that's an optional optimizing phase whose heuristics are still being tuned: we didn't use it on this, and in fact, don't generally use them on synthetic benchmarks at all: it's too destructive! (There's nothing like seeing functions being grabbed in-line, discovering that they don't do anything, and then just optimizing the whole thing away. At least Whetstone computes and prints some numbers, so some real work got done. Nevertheless, David's comments are appropriate, i.e., we share the same skepticism of Whetstone, as I'd noted in the original posting). >Maybe benchmark performance should be normalized for software as well >as hardware technology. True! Some interesting work on that line was done over at Stanford by Fred Chow, who did a machine-independent optimizer with multiple back-ends to be able to compare machines using same compiler technology. That's probably the best way to factor it out. The other interesting way is to be able to turn optimizations on/off and see how much difference they make. > > I've discussed benchmarking issues at length in the Floating- >Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading Is this out yet? Sounds good. Previous memos have been useful. >to the recommendation that the nonlinear optimization and zero-finding >that P3 is intended to mimic is better benchmarked by the real thing, >such as the SPICE program. Yes, although it would be awfully nice to have smaller hunks of it that could be turned into reasonable-size benchmarks, especially ones that could be simulated (in advance of CPU design) a little easier. > > Linear problems are usually characterized by large dimension and >therefore memory and bus performance is as important as peak >floating-point performance; a Linpack benchmark with suitably- >dimensioned arrays is appropriate. Yes. > > I don't know whether RISC or CISC designs will prove to give the >most bang for the buck, but I do have some philosophical questions for >RISC gurus: Is hardware floating point faster than software floating >point on RISC systems? If so, and it is because the FPU technology is >faster than the CPU, then why isn't the CPU fabricated with that tech- >nology? If it's just a matter of obtaining parallelism, then wouldn't >two identical CPU's work just as well and be more flexible for non- >floating-point applications? If there are functional units on the FPU >that aren't on the CPU, should they be on the CPU so non-floating- >point instructions can use them if desirable? If the CPU and FPU are >one chip, cycle times should be slower, but would the reduced communi- >cation overhead compensate? If you use separate heterogeneous proces- >sors, don't you end up with ... a CISC? 1) Is hardware FP faster? Yes. 2) No, technology is the same, at least in our case. I don't know what other people do. 3) It's not just parallelism, but dedicating the right kind of hardware. A 32-bit integer CPU has no particular reason to have the kinds of datapaths an FPU needs. There are functional units on the FPU, but they aren't ones that help the CPU much (or they would have been on the CPU in the first place!) 4) Would reduced communication overhead compensate? Probably not, at the current state of technology that is generally available. Right now, at least in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it just has to be heavily microcoded. It's only when chip shrinkage gets enough that you can put the fastest FPU together with the CPU on 1 chip, and have nothing better to put on that chip, that it's worth doing for performance. (Note: there may be other reasons, or different price/performance aim points for integrating them, but if you want FP performance, you must dedicate significant silicon real-estate.) 5) Don't you end up with ... a CISC? I'm not sure what this means. RISC means different things to different people. What it usually means to us is: a) Design approach where hardware resources are concentrated on things that are performance-critical and universal. b) The belief that in making things fast, instructions and/or complex addressing formats drop out, NOT as a GOAL,but as a side-effect. Thus, in our case, we designed a CPU that would go fast for integer performance, and have a tight-coupled coprocessor interface that would let FP go fast also. (Note: integer performance is universal, whereas FP is mostly bimodal: people either don't care about it all, or want as much as they can get.) When you measure integer programs, you make choices to include or delete features, according to the statistics seen in measuring substantial programs. You do the same thing for FP-intensive programs. Guess what! You discover that FP Adds, Subtracts, Multiplies (and maybe Divides) are: a) Good Things b) Not simulatable by integer arithmetic very quickly. However, suppose that we'd discovered that FP Divide happened so seldom that it could be simulated in software at an adequate performance level, and that taking that silicon and using it to make FP Mult faster gave better overall performance. In that case, we might have done it that way. In any case, we don't see any conflict in having a RISC with FP, (or decimal, or ...anything where some important class of application needs hardware thrown at it and can justify the cost of having it.) Seymour Cray has been doing fast machines for years with similar design principles (if at a different cost point!) and FP has certainly been there. Anyway, thanks for the additional data. Also, I'd be happy to see more discussion on what metrics are reasonable [especially since the original posting invented "Whetstones/MHz" on the spur of the moment, and there have been some interesting side discussions generated, both on: a) Are KWhets a good choice? b) What's a MHz? As can be seen, this business is still clearly in need of benchmarks that: a) measure something real. b) measure something understandable. c) are small enough that they can be run and simulated in reasonable time. d) predict real performance of adequate-sized classes of programs. e) are used by enough people that you can do comparisons. -- -john mashey DISCLAIMER: UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086