Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request From: dgh@sun.com (David Hough) Newsgroups: comp.sys.sun Subject: libm in SunOS 4.0 Keywords: Software Message-ID: <8904130100.AA14938@dgh.sun.com> Date: 3 May 89 13:17:55 GMT Sender: usenet@rice.edu Organization: Sun-Spots Lines: 211 Approved: Sun-Spots@rice.edu Original-Date: Wed, 12 Apr 89 18:00:06 PDT X-Sun-Spots-Digest: Volume 7, Issue 264, message 13 of 13 In recent Sun-Spots, Peter Lamb has complained about libm in SunOS 4.0. He's raised a number of interesting points. The following examines the issues. We'll repeat his timing experiments in a little simpler form. The otherwise worthless "savage" benchmark happens to be ideal for the task at hand, since its inner loop consists almost entirely of elementary transcendental functions; I added two register declarations: /* * savage.c -- floating point speed and accuracy test. C version derived * from BASIC version which appeared in Dr. Dobb's Journal, Sep. 1983, pp. * 120-122. */ #define ILOOP 100000 #include extern double tan(), atan(), exp(), log(), sqrt(); main() { int i; register double a=1, one=1; for (i = 1; i <= (ILOOP - 1); i++) a = tan(atan(exp(log(sqrt(a * a))))) + one; printf("a-ILOOP = %g0, a - ILOOP); exit(0); /* Better get in the habit of adding this! */ } Here's some compile lines and timing results from a Sun-3/140: [[ I removed "savage.c" from each compile line to make the table fit in 80 columns. --wnl ]] SunOS Compile line a.out residual meets time a-ILOOP SVID? seconds 3.5 cc -O4 -f68881 -lm 26 -1.34482e-06 no 3.5 cc -O4 -f68881 /usr/lib/f68881.il -lm 19 -1.34482e-06 no 4.0 cc -O4 -f68881 -lm 153 -1.34482e-06 yes 4.0 cc -O4 -f68881 /usr/lib/f68881/libm.il 17 -1.34482e-06 no 4.0 cc -O4 -f68881 math.S 19 -1.34482e-06 no 4.0 cc -O4 -f68881 math.il 13 4.83633e-08 no math.S and math.il are listed later. What conclusions does this table suggest? * In 3.5->4.0 the fast got faster. * In 3.5->4.0 the slow got slower. * In 4.0 it is possible to obtain some SVID (System V Interface Definition) compliance even with -f68881. It doesn't matter for this program but it does if you run the SV Validation Suite. * In 4.0 both functions and inline expansion templates could have been faster. * The last executable listed is smallest, fastest, and most accurate, for indeed its inner loop is: main+0x16: fmulx fp7,fp7 main+0x1a: fsqrtx fp7,fp7 main+0x1e: flognx fp7,fp7 main+0x22: fetoxx fp7,fp7 main+0x26: fatanx fp7,fp7 main+0x2a: ftanx fp7,fp7 main+0x2e: faddx fp6,fp7 main+0x32: addql #1,d7 main+0x34: cmpl #0x1869f,d7 main+0x3a: bles main+0x16 which could scarcely be improved upon. This is the main benefit of inline expansion of function calls: when they work well, all the direct and indirect effects of function calls are eliminated. Let's examine each of those possible conclusions. * In SunOS 3.5 the compiler generates some workarounds for A79J 68881's. These were removed for 4.0, so most 68881's can run faster. That made the inline templates more effective. Thus the fast got faster. Also the SunOS 4.0 compiler invokes a global optimizer but that doesn't affect this program much. * In SunOS 3.5, if you compiled with -f68881 or -ffpa the libm didn't meet the SVID requirements for errno and matherr. That was fixed in 4.0, at a significant per- formance penalty; given that, I figured that anybody who cared about floating-point performance in C was going to use the inline expansion templates all the time, so I optimized them and didn't bother with the corresponding libm functions. The SVID requirements are wrong-headed; X3J11 saw half the light and removed matherr without grasping that the arguments they used to remove matherr were equally appropriate for errno. Anyway, if you don't use the inline expansion templates in 4.0 you conform to the SVID whether you need to or not. Thus the slow got slower. Indeed avoiding the SVID performance penalties is one of the main reasons that C programmers would use the inline expansion tem- plates in 4.0. * SunOS 4.0 libm functions would obviously be faster if they ignored the SVID. Here is a corresponding math.S file: #define FUNC(F,G) \ .globl _/**/F ;\ _/**/F: movel sp@+,a0 ; \ f/**/G/**/d sp@,fp0 ; \ fmoved fp0,sp@ ; \ movel sp@,d0 ; \ movel sp@(4),d1 ; \ jmp a0@ FUNC(sqrt,sqrt) FUNC(exp,etox) FUNC(log,logn) FUNC(tan,tan) FUNC(atan,atan) * What wasn't apparent until Peter Lamb provoked an investigation is that the 4.0 inline templates weren't well matched with the capabilities of c2, the local optimizer that follows the inline expansion. c2 likes to see sp@+ and sp@- but not sp@; a revised math.IL file: #define FUNC(F,G) \ .inline _/**/F,8 ;\ f/**/G/**/d sp@+,fp0 ; \ fmoved fp0,sp@- ; \ movel sp@+,d0 ; \ movel sp@+,d1 ; \ .end FUNC(sqrt,sqrt) FUNC(exp,etox) FUNC(log,logn) FUNC(tan,tan) FUNC(atan,atan) which can be converted to a math.il this way cpp math.IL | sed 'y/;/\n/' since cc doesn't handle .IL files! Anyway the inline expansion templates have been revised correspondingly for SunOS 4.1. Why Sun-3? If you have a Sun-3 on your desk, as I do, then natur- ally you want to make the most of it. But when your budget permits you may well want to upgrade to a Sun-4. As announced today, the entry price has been substantially reduced. Since the SPARC architecture, unlike MC68881, defines fsqrt but no elementary transcendental function instructions, the libm performance penalty related to SVID is much reduced. Why C? Why program numerical work in C when Fortran is almost always more efficient? Examples supporting the latter assertion: sqrt is an operator in Fortran, a function in C; Fortran pointers (parameters) can be assumed to be unaliased, but not in C. The issues Peter Lamb raised don't exist in Sun Fortran; fsqrt instructions are simply gen- erated inline as needed without resorting to libm or .il files. Of course creating a complete application by combining numerical Fortran code with non-numerical C code is not very easy to do in a machine-independent way; I tried to get X3J11 interested in that problem, so much more significant than errno, without success. Why Inline Expansion Templates? Sun's inline expansion template facility is probably not exactly like anybody else's, and thus unfamiliar. The facility was originally intended to provide a quick fix to some pernicious problems such as complex arithmetic perfor- mance in Fortran prior to implementation of the definitive solution in the rest of the compiler. The best way to think of it is that you can redesign parts of the compiler with inline expansion templates. Sun-supplied algorithm too slow or too accurate? Write your own. Questions for the Reader Tell me what you think about the following: * Should SunOS provide two versions of libm, one that conforms to SVID, X3J11, and X/Open requirements and one that doesn't compromise performance? * Should SunOS provide means of EASILY obtaining maximum performance without having to read many pages of obscure manuals? Note that bundling additional options into -O or -O4 might NOT be a good idea since optimiza- tion levels are somewhat independent of other types of optimizations such as inline expansion templates. Embedded systems with limited physical memory, for instance, may prefer to call a function than suffer code expansion. So the question is whether a new bun- dled compiler option such as "-allopts" would be appropriate. For More Information Check out the SVID Volume 1 and the X3J11 draft and rationale, and maybe the MC68881/2 manual. And (once again) the Floating-Point Programmer's Guide in your SunOS doc crate and especially the 4.0 addendum in the Programmer's Guides Minibox Read This First. If you are curious about C's shortcomings in the numerical area, I have written a much longer memorandum as part of the X3J11 public review; I will send troff source on request. If you are even more curious then contact Rex Jaeschke (uunet.uu.net!aussie!rex) about the Numerical C Extensions Group.