Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!think!nike!oliveb!glacier!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: net.arch
Subject: Re: Floating point performance
Message-ID: <725@mips.UUCP>
Date: Thu, 16-Oct-86 00:03:13 EDT
Article-I.D.: mips.725
Posted: Thu Oct 16 00:03:13 1986
Date-Received: Thu, 16-Oct-86 22:55:43 EDT
References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP> <8184@sun.uucp>
Reply-To: mash@mips.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 133

In article <8184@sun.uucp> dgh@sun.uucp (David Hough) writes:
>
>                            Mflops Per MHz
>     I'd like to add to John Mashey's recent posting about floating-
>point performance....
Thanks; as I'd said, parentage of numbers was suspect, so it's good to
see some I trust some more.
>
>Machine         CPU Mhz FPU MHz  KW   KW/CPUMhz KW/FPUMHz
>
>Sun-3/160+68881 16.7    16.7     955     60      60
Oops, I'd thought you guys used 12.5Mhz 68881s at one point [but I checked
the current literature and it says no.  Has it changed recently?

> ....  Whetstone
>benchmark performance is mostly determined by the following code:
> (bunch of code) ...
>
>On Weitek 1164/1165-based systems, execution time for the P3 loop is
>dominated by the division operation...
>something far more interesting: the best way to improve Whetstone per-
>formance is to do enough global inter-procedural optimization in your
>compiler to determine that P3 only needs to be called once.  This
>gives a 2X performance increase with no hardware work at all! One MIPS
>paper suggests that the MIPS compiler does this or something similar.
Actually, that's an optional optimizing phase whose heuristics are
still being tuned: we didn't use it on this, and in fact, don't generally
use them on synthetic benchmarks at all: it's too destructive!
(There's nothing like seeing functions being grabbed in-line, discovering
that they don't do anything, and then just optimizing the whole thing away.
At least Whetstone computes and prints some numbers, so some real
work got done.  Nevertheless, David's comments are appropriate, i.e., we
share the same skepticism of Whetstone, as I'd noted in the original
posting).

>Maybe benchmark performance should be normalized for software as well
>as hardware technology.
True! Some interesting work on that line was done over at Stanford by
Fred Chow, who did a machine-independent optimizer with multiple back-ends
to be able to compare machines using same compiler technology.  That's
probably the best way to factor it out.  The other interesting way is to
be able to turn optimizations on/off and see how much difference they make.
>
>     I've discussed benchmarking issues at length in the Floating-
>Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading
Is this out yet? Sounds good.  Previous memos have been useful.
>to the recommendation that the nonlinear optimization and zero-finding
>that P3 is intended to mimic is better benchmarked by the real thing,
>such as the SPICE program.
Yes, although it would be awfully nice to have smaller hunks of it that
could be turned into reasonable-size benchmarks, especially ones that
could be simulated (in advance of CPU design) a little easier.
>
>     Linear problems are usually characterized by large dimension and
>therefore memory and bus performance is as important as peak
>floating-point performance; a Linpack benchmark with suitably-
>dimensioned arrays is appropriate.
Yes.
>
>     I don't know whether RISC or CISC designs will prove to give the
>most bang for the buck, but I do have some philosophical questions for
>RISC gurus:  Is hardware floating point faster than software floating
>point on RISC systems?  If so, and it is because the FPU technology is
>faster than the CPU, then why isn't the CPU fabricated with that tech-
>nology?  If it's just a matter of obtaining parallelism, then wouldn't
>two identical CPU's work just as well and be more flexible for non-
>floating-point applications? If there are functional units on the FPU
>that aren't on the CPU, should they be on the CPU so non-floating-
>point instructions can use them if desirable? If the CPU and FPU are
>one chip, cycle times should be slower, but would the reduced communi-
>cation overhead compensate?  If you use separate heterogeneous proces-
>sors, don't you end up with ... a CISC?

1) Is hardware FP faster?  Yes.
2) No, technology is the same, at least in our case.  I don't know what
other people do.
3) It's not just parallelism, but dedicating the right kind of hardware.
A 32-bit integer CPU has no particular reason to have the kinds of datapaths
an FPU needs.  There are functional units on the FPU, but they aren't ones
that help the CPU much (or they would have been on the CPU in the first place!)
4) Would reduced communication overhead compensate?  Probably not, at the
current state of technology that is generally available.  Right now, at least
in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it
just has to be heavily microcoded.  It's only when chip shrinkage gets enough
that you can put the fastest FPU together with the CPU on 1 chip, and have
nothing better to put on that chip, that it's worth doing for performance.
(Note: there may be other reasons, or different price/performance aim points
for integrating them, but if you want FP performance, you must dedicate
significant silicon real-estate.)
5) Don't you end up with ... a CISC?  I'm not sure what this means.  RISC
means different things to different people.  What it usually means to us is:
	a) Design approach where hardware resources are concentrated on things
	that are performance-critical and universal.
	b) The belief that in making things fast, instructions and/or
	complex addressing formats drop out, NOT as a GOAL,but as a side-effect.
Thus, in our case, we designed a CPU that would go fast for integer performance,
and have a tight-coupled coprocessor interface that would let FP go fast also.
(Note: integer performance is universal, whereas FP is mostly bimodal:
people either don't care about it all, or want as much as they can get.)
When you measure integer programs, you make choices to include or delete
features, according to the statistics seen in measuring substantial programs.
You do the same thing for FP-intensive programs.  Guess what! You discover
that FP Adds, Subtracts, Multiplies (and maybe Divides) are:
	a) Good Things
	b) Not simulatable by integer arithmetic very quickly.
However, suppose that we'd discovered that FP Divide happened so seldom
that it could be simulated in software at an adequate performance level,
and that taking that silicon and using it to make FP Mult faster gave better
overall performance.  In that case, we might have done it that way.

In any case, we don't see any conflict in having a RISC with FP,
(or decimal, or ...anything where some important class of application needs
hardware thrown at it and can justify the cost of having it.)
Seymour Cray has been doing fast machines for years with similar design
principles (if at a different cost point!) and FP has certainly been there.

Anyway, thanks for the additional data.  Also, I'd be happy to see more
discussion on what metrics are reasonable [especially since the original
posting invented "Whetstones/MHz" on the spur of the moment, and there
have been some interesting side discussions generated, both on:
	a) Are KWhets a good choice?
	b) What's a MHz?
As can be seen, this business is still clearly in need of benchmarks that:
	a) measure something real.
	b) measure something understandable.
	c) are small enough that they can be run and simulated in reasonable
	time.
	d) predict real performance of adequate-sized classes of programs.
	e) are used by enough people that you can do comparisons.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086