Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!columbia!rutgers!sri-spam!nike!ucbcad!ucbvax!decvax!decwrl!sun!dgh
From: dgh@sun.UUCP
Newsgroups: net.arch
Subject: Re: Floating point performance
Message-ID: <8184@sun.uucp>
Date: Tue, 14-Oct-86 21:16:52 EDT
Article-I.D.: sun.8184
Posted: Tue Oct 14 21:16:52 1986
Date-Received: Wed, 15-Oct-86 20:15:17 EDT
References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP>
Organization: Sun Microsystems, Inc.
Lines: 85


                            Mflops Per MHz

                             David Hough
			   dhough@sun.com

     I'd like to add to John Mashey's recent posting about floating-
point performance.  In the following table extracted and revised from
that posting, the Sun-3 measurements are mine; the MIPS numbers are
Mashey's.  All KW results indicate thousands of double precision Whet-
stone instructions per second.  Results marked * represent implementa-
tions based on Weitek chips.  As Mashey points out, it's not clear
whether the MHz should refer to the CPU or FPU, so I included both.

Machine         CPU Mhz FPU MHz  KW   KW/CPUMhz KW/FPUMHz

Sun-3/160+68881 16.7    16.7     955     60      60
Sun-3/160+68881 25      20      1240     50      60
Sun-3/160+FPA*  16.7    16.7    1840    100     100
Sun-3/260+FPA*  25      16.7    2600    100     160

MIPS R2360*      8       8      1160    140     140     (interim restrictions)
MIPS R2010       8       8      4500    560     560     (simulated)

     As you puzzle over the meaning of these results, remember that
elementary transcendental function routines have minor effect on Whet-
stone performance when the hardware is high-performance.  Whetstone
benchmark performance is mostly determined by the following code:

                DO 90 I=1,N8
                CALL P3(X,Y,Z)
   90           CONTINUE


        SUBROUTINE P3(X,Y,Z)
        IMPLICIT REAL (A-H,O-Z)
        COMMON T,T1,T2,E1(4),J,K,L
        X1 = X
        Y1 = Y
        X1 = T * (X1 + Y1)
        Y1 = T * (X1 + Y1)
        Z = (X1 + Y1) / T2
        RETURN
        END

On Weitek 1164/1165-based systems, execution time for the P3 loop is
dominated by the division operation, which is about 6 times slower
than an addition or multiplication and can't be overlapped with any
other operation, inhibiting pipelining.  Furthermore, not only can no
1164 operation overlap any 1165 operation, but parallel invocation of
P3 calls can't be justified without doing enough analysis to discover
something far more interesting: the best way to improve Whetstone per-
formance is to do enough global inter-procedural optimization in your
compiler to determine that P3 only needs to be called once.  This
gives a 2X performance increase with no hardware work at all! One MIPS
paper suggests that the MIPS compiler does this or something similar.
Maybe benchmark performance should be normalized for software as well
as hardware technology.

     I've discussed benchmarking issues at length in the Floating-
Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading
to the recommendation that the nonlinear optimization and zero-finding
that P3 is intended to mimic is better benchmarked by the real thing,
such as the SPICE program.  Of course, SPICE is a complicated real
application and its performance is difficult to predict in advance,
and that makes marketing and management scientists everywhere uneasy.

     Linear problems are usually characterized by large dimension and
therefore memory and bus performance is as important as peak
floating-point performance; a Linpack benchmark with suitably-
dimensioned arrays is appropriate.

     I don't know whether RISC or CISC designs will prove to give the
most bang for the buck, but I do have some philosophical questions for
RISC gurus:  Is hardware floating point faster than software floating
point on RISC systems?  If so, and it is because the FPU technology is
faster than the CPU, then why isn't the CPU fabricated with that tech-
nology?  If it's just a matter of obtaining parallelism, then wouldn't
two identical CPU's work just as well and be more flexible for non-
floating-point applications? If there are functional units on the FPU
that aren't on the CPU, should they be on the CPU so non-floating-
point instructions can use them if desirable? If the CPU and FPU are
one chip, cycle times should be slower, but would the reduced communi-
cation overhead compensate?  If you use separate heterogeneous proces-
sors, don't you end up with ... a CISC?