Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request
From: dgh@sun.com (David Hough)
Newsgroups: comp.sys.sun
Subject: libm in SunOS 4.0
Keywords: Software
Message-ID: <8904130100.AA14938@dgh.sun.com>
Date: 3 May 89 13:17:55 GMT
Sender: usenet@rice.edu
Organization: Sun-Spots
Lines: 211
Approved: Sun-Spots@rice.edu
Original-Date: Wed, 12 Apr 89 18:00:06 PDT
X-Sun-Spots-Digest: Volume 7, Issue 264, message 13 of 13

In recent Sun-Spots, Peter Lamb has complained about libm in SunOS 4.0.
He's raised a number of interesting points.  The following examines the
issues.

We'll repeat his timing experiments in a little simpler form.  The
otherwise worthless "savage" benchmark happens to be ideal for the task at
hand, since its inner loop consists almost entirely of elementary
transcendental functions; I added two register declarations:

/*
 * savage.c -- floating point speed and accuracy test.  C version derived
 * from BASIC version which appeared in Dr. Dobb's Journal, Sep. 1983, pp.
 * 120-122.
 */

#define ILOOP   100000
#include <stdio.h>

extern double   tan(), atan(), exp(), log(), sqrt();

main()
{
        int             i;
        register double a=1, one=1;

        for (i = 1; i <= (ILOOP - 1); i++)
                a = tan(atan(exp(log(sqrt(a * a))))) + one;
        printf("a-ILOOP = %g0, a - ILOOP);
        exit(0);        /* Better get in the habit of adding this! */
}

Here's some compile lines and timing results from a Sun-3/140:

[[ I removed "savage.c" from each compile line to make the table fit in 80
columns.  --wnl ]]

SunOS   Compile line                               a.out      residual     meets
                                                   time       a-ILOOP      SVID?
                                                  seconds

3.5     cc -O4 -f68881 -lm                          26     -1.34482e-06      no
3.5     cc -O4 -f68881 /usr/lib/f68881.il -lm       19     -1.34482e-06      no
4.0     cc -O4 -f68881 -lm                         153     -1.34482e-06     yes
4.0     cc -O4 -f68881 /usr/lib/f68881/libm.il      17     -1.34482e-06      no
4.0     cc -O4 -f68881 math.S                       19     -1.34482e-06      no
4.0     cc -O4 -f68881 math.il                      13      4.83633e-08      no


     math.S and math.il are listed later.  What conclusions
does this table suggest?

*    In 3.5->4.0 the fast got faster.

*    In 3.5->4.0 the slow got slower.

*    In 4.0 it is possible to obtain some SVID (System V
     Interface Definition) compliance even with -f68881.  It
     doesn't matter for this program but it does if you run
     the SV Validation Suite.

*    In 4.0 both functions and inline expansion templates
     could have been faster.

*    The last executable listed is smallest, fastest, and
     most accurate, for indeed its inner loop is:

      main+0x16:             fmulx   fp7,fp7
      main+0x1a:             fsqrtx  fp7,fp7
      main+0x1e:             flognx  fp7,fp7
      main+0x22:             fetoxx  fp7,fp7
      main+0x26:             fatanx  fp7,fp7
      main+0x2a:             ftanx   fp7,fp7
      main+0x2e:             faddx   fp6,fp7
      main+0x32:             addql   #1,d7
      main+0x34:             cmpl    #0x1869f,d7
      main+0x3a:             bles     main+0x16

     which could scarcely be improved upon.  This is the main benefit of
     inline expansion of function calls: when they work well, all the
     direct and indirect effects of function calls are eliminated.

     Let's examine each of those possible conclusions.

*    In SunOS 3.5 the compiler generates some workarounds for A79J
     68881's.  These were removed for 4.0, so most 68881's can run faster.
     That made the inline templates more effective. Thus the fast got
     faster.  Also the SunOS 4.0 compiler invokes a global optimizer but
     that doesn't affect this program much.

*    In SunOS 3.5, if you compiled with -f68881 or -ffpa the libm didn't
     meet the SVID requirements for errno and matherr.  That was fixed in
     4.0, at a significant per- formance penalty; given that, I figured
     that anybody who cared about floating-point performance in C was
     going to use the inline expansion templates all the time, so I
     optimized them and didn't bother with the corresponding libm
     functions.   The SVID requirements are wrong-headed; X3J11 saw half
     the light and removed matherr without grasping that the arguments
     they used to remove matherr were equally appropriate for errno.
     Anyway, if you don't use the inline expansion templates in 4.0 you
     conform to the SVID whether you need to or not.  Thus the slow got
     slower.  Indeed avoiding the SVID performance penalties is one of the
     main reasons that C programmers would use the inline expansion tem-
     plates in 4.0.

*    SunOS 4.0 libm functions would obviously be faster if they ignored
     the SVID.  Here is a corresponding math.S file:

     #define FUNC(F,G) \
          .globl  	_/**/F ;\
     _/**/F: movel	sp@+,a0 ; \
          f/**/G/**/d	sp@,fp0 ; \
          fmoved	fp0,sp@ ; \
          movel		sp@,d0 ; \
          movel		sp@(4),d1 ; \
          jmp		a0@

          FUNC(sqrt,sqrt)
          FUNC(exp,etox)
          FUNC(log,logn)
          FUNC(tan,tan)
          FUNC(atan,atan)


*    What wasn't apparent until Peter Lamb provoked an investigation is
     that the 4.0 inline templates weren't well matched with the
     capabilities of c2, the local optimizer that follows the inline
     expansion.  c2 likes to see sp@+ and sp@- but not sp@; a revised
     math.IL file:

     #define FUNC(F,G) \
          .inline 	_/**/F,8 ;\
          f/**/G/**/d 	sp@+,fp0 ; \
          fmoved	fp0,sp@- ; \
          movel		sp@+,d0 ; \
          movel		sp@+,d1 ; \
          .end

          FUNC(sqrt,sqrt)
          FUNC(exp,etox)
          FUNC(log,logn)
          FUNC(tan,tan)
          FUNC(atan,atan)

     which can be converted to a math.il this way
          cpp math.IL | sed 'y/;/\n/'
     since cc doesn't handle .IL files!  Anyway the inline
     expansion templates have been revised correspondingly
     for SunOS 4.1.

Why Sun-3?

If you have a Sun-3 on your desk, as I do, then natur- ally you want to
make the most of it.  But when your budget permits you may well want to
upgrade to a Sun-4.  As announced today, the entry price has been
substantially reduced.  Since the SPARC architecture, unlike MC68881,
defines fsqrt but no elementary transcendental function instructions, the
libm performance penalty related to SVID is much reduced.

Why C?

Why program numerical work in C when Fortran is almost always more
efficient?  Examples supporting the latter assertion: sqrt is an operator
in Fortran, a function in C; Fortran pointers (parameters) can be assumed
to be unaliased, but not in C.  The issues Peter Lamb raised don't exist
in Sun Fortran; fsqrt instructions are simply gen- erated inline as needed
without resorting to libm or .il files.

Of course creating a complete application by combining numerical Fortran
code with non-numerical C code is not very easy to do in a
machine-independent way; I tried to get X3J11 interested in that problem,
so much more significant than errno, without success.

Why Inline Expansion Templates?

Sun's inline expansion template facility is probably not exactly like
anybody else's, and thus unfamiliar. The facility was originally intended
to provide a quick fix to some pernicious problems such as complex
arithmetic perfor- mance in Fortran prior to implementation of the
definitive solution in the rest of the compiler.  The best way to think of
it is that you can redesign parts of the compiler with inline expansion
templates.  Sun-supplied algorithm too slow or too accurate?  Write your
own.

Questions for the Reader

     Tell me what you think about the following:

*    Should SunOS provide two versions of libm, one that conforms to SVID,
     X3J11, and X/Open requirements and one that doesn't compromise
     performance?

*    Should SunOS provide means of EASILY obtaining maximum performance
     without having to read many pages of obscure manuals?  Note that
     bundling additional options into -O or -O4 might NOT be a good idea
     since optimiza- tion levels are somewhat independent of other types
     of optimizations such as inline expansion templates.  Embedded
     systems with limited physical memory, for instance, may prefer to
     call a function than suffer code expansion.  So the question is
     whether a new bun- dled compiler option such as "-allopts" would be
     appropriate.

For More Information

Check out the SVID Volume 1 and the X3J11 draft and rationale, and maybe
the MC68881/2 manual.  And (once again) the Floating-Point Programmer's
Guide in your SunOS doc crate and especially the 4.0 addendum in the
Programmer's Guides Minibox Read This First.  If you are curious about C's
shortcomings in the numerical area, I have written a much longer
memorandum as part of the X3J11 public review; I will send troff source on
request.  If you are even more curious then contact Rex Jaeschke
(uunet.uu.net!aussie!rex) about the Numerical C Extensions Group.