Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!rpi!crdgw1!crdos1!davidsen
From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr)
Newsgroups: comp.arch
Subject: Re: Divide in 1 cycle
Message-ID: <3258@crdos1.crd.ge.COM>
Date: 13 Mar 91 15:13:45 GMT
References: <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> <777@spim.mips.COM> <1991Mar8.110801.20042@bingvaxu.cc.binghamton.edu> <1991Mar12.043839.11068@tera.com>
Reply-To: davidsen@crdos1.crd.ge.com (bill davidsen)
Organization: GE Corp R&D Center, Schenectady NY
Lines: 55

In article <1991Mar12.043839.11068@tera.com> bob@tera.com (Bob Alverson) writes:

| For the unlucky whose divisors aren't known to the compiler and aren't loop
| invariant, the divide rate drops to one result every nine ticks.  The only
| significant hardware dedicated to divide is a 256 entry lookup table and
| an 8x8 -> 16 multiplier for the initial reciprocal approximation.

  I got these results from a 386-25:

#   System id: Dell 325, 4MB, 150MB, Xenix/386 2.3.3, 387
#   
#   Math operations, effective instructions/sec (thousands)
#   
#                  Add     Sub     Mpy     Div    Wtd HM
#   short:      7451.0  7378.6  3023.3  2933.3    4656.6
#   long:       7600.0  7368.4  2692.3  2000.0    4031.5
#   float:      1168.8  1168.8   975.6   933.3    1074.9
#   double:     1025.6  1012.7   750.0   789.5     899.2
#
#   Test and branch timing:
#   integer compare and branch    0.688 uSec,   1453.5K/sec
#     float compare and branch    4.320 uSec,    231.5K/sec

The divide speed would indicate about 9 ticks for 16 bit, about 12.5 for
32 bit.  The 486-25 looks like this:

#   System id: HP 486-25, SCO ODT 1.0, 10MB, 300MB, cc
#   
#   Math operations, effective instructions/sec (thousands)
#   
#                  Add     Sub     Mpy     Div    Wtd HM
#   short:     17934.1 18000.0  3483.9  2936.2    6400.4
#   long:      20400.0 19695.6  3225.8  2042.6    5528.7
#   float:      4258.1  4252.0  3829.8  1515.8    3276.1
#   double:     4129.0  4087.9  3260.9  1345.8    2992.5
#   
#   Test and branch timing:
#   integer compare and branch    0.247 uSec,   4054.0K/sec
#     float compare and branch    0.850 uSec,   1176.5K/sec

  This would indicate that the 486 didn't get much help on divide, but
add and subtract, as well as compare and branch, got a big boost. My
overall results for a bunch of programs was that the 486 was about 2.6x
faster than the 386 at the same speed.

  Note: these figures are presented as ballpark figures, and represent
measured performance obtained using C rather than assembler. While they
are proportional to actual hardware performance, they are not best case
performance. On the other hand I started building this benchmark suite
in 1970... it measures performance of individual performance aspects,
looking for those "jackpot cases" where performance is really bad.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"