Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!columbia!rutgers!sri-spam!nike!ucbcad!ucbvax!decvax!decwrl!sun!dgh From: dgh@sun.UUCP Newsgroups: net.arch Subject: Re: Floating point performance Message-ID: <8184@sun.uucp> Date: Tue, 14-Oct-86 21:16:52 EDT Article-I.D.: sun.8184 Posted: Tue Oct 14 21:16:52 1986 Date-Received: Wed, 15-Oct-86 20:15:17 EDT References: <340@euroies.UUCP> <1989@videovax.UUCP> <722@mips.UUCP> Organization: Sun Microsystems, Inc. Lines: 85 Mflops Per MHz David Hough dhough@sun.com I'd like to add to John Mashey's recent posting about floating- point performance. In the following table extracted and revised from that posting, the Sun-3 measurements are mine; the MIPS numbers are Mashey's. All KW results indicate thousands of double precision Whet- stone instructions per second. Results marked * represent implementa- tions based on Weitek chips. As Mashey points out, it's not clear whether the MHz should refer to the CPU or FPU, so I included both. Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz Sun-3/160+68881 16.7 16.7 955 60 60 Sun-3/160+68881 25 20 1240 50 60 Sun-3/160+FPA* 16.7 16.7 1840 100 100 Sun-3/260+FPA* 25 16.7 2600 100 160 MIPS R2360* 8 8 1160 140 140 (interim restrictions) MIPS R2010 8 8 4500 560 560 (simulated) As you puzzle over the meaning of these results, remember that elementary transcendental function routines have minor effect on Whet- stone performance when the hardware is high-performance. Whetstone benchmark performance is mostly determined by the following code: DO 90 I=1,N8 CALL P3(X,Y,Z) 90 CONTINUE SUBROUTINE P3(X,Y,Z) IMPLICIT REAL (A-H,O-Z) COMMON T,T1,T2,E1(4),J,K,L X1 = X Y1 = Y X1 = T * (X1 + Y1) Y1 = T * (X1 + Y1) Z = (X1 + Y1) / T2 RETURN END On Weitek 1164/1165-based systems, execution time for the P3 loop is dominated by the division operation, which is about 6 times slower than an addition or multiplication and can't be overlapped with any other operation, inhibiting pipelining. Furthermore, not only can no 1164 operation overlap any 1165 operation, but parallel invocation of P3 calls can't be justified without doing enough analysis to discover something far more interesting: the best way to improve Whetstone per- formance is to do enough global inter-procedural optimization in your compiler to determine that P3 only needs to be called once. This gives a 2X performance increase with no hardware work at all! One MIPS paper suggests that the MIPS compiler does this or something similar. Maybe benchmark performance should be normalized for software as well as hardware technology. I've discussed benchmarking issues at length in the Floating- Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading to the recommendation that the nonlinear optimization and zero-finding that P3 is intended to mimic is better benchmarked by the real thing, such as the SPICE program. Of course, SPICE is a complicated real application and its performance is difficult to predict in advance, and that makes marketing and management scientists everywhere uneasy. Linear problems are usually characterized by large dimension and therefore memory and bus performance is as important as peak floating-point performance; a Linpack benchmark with suitably- dimensioned arrays is appropriate. I don't know whether RISC or CISC designs will prove to give the most bang for the buck, but I do have some philosophical questions for RISC gurus: Is hardware floating point faster than software floating point on RISC systems? If so, and it is because the FPU technology is faster than the CPU, then why isn't the CPU fabricated with that tech- nology? If it's just a matter of obtaining parallelism, then wouldn't two identical CPU's work just as well and be more flexible for non- floating-point applications? If there are functional units on the FPU that aren't on the CPU, should they be on the CPU so non-floating- point instructions can use them if desirable? If the CPU and FPU are one chip, cycle times should be slower, but would the reduced communi- cation overhead compensate? If you use separate heterogeneous proces- sors, don't you end up with ... a CISC?