Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!rpi!crdgw1!crdos1!davidsen From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) Newsgroups: comp.arch Subject: Re: Divide in 1 cycle Message-ID: <3258@crdos1.crd.ge.COM> Date: 13 Mar 91 15:13:45 GMT References: <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> <777@spim.mips.COM> <1991Mar8.110801.20042@bingvaxu.cc.binghamton.edu> <1991Mar12.043839.11068@tera.com> Reply-To: davidsen@crdos1.crd.ge.com (bill davidsen) Organization: GE Corp R&D Center, Schenectady NY Lines: 55 In article <1991Mar12.043839.11068@tera.com> bob@tera.com (Bob Alverson) writes: | For the unlucky whose divisors aren't known to the compiler and aren't loop | invariant, the divide rate drops to one result every nine ticks. The only | significant hardware dedicated to divide is a 256 entry lookup table and | an 8x8 -> 16 multiplier for the initial reciprocal approximation. I got these results from a 386-25: # System id: Dell 325, 4MB, 150MB, Xenix/386 2.3.3, 387 # # Math operations, effective instructions/sec (thousands) # # Add Sub Mpy Div Wtd HM # short: 7451.0 7378.6 3023.3 2933.3 4656.6 # long: 7600.0 7368.4 2692.3 2000.0 4031.5 # float: 1168.8 1168.8 975.6 933.3 1074.9 # double: 1025.6 1012.7 750.0 789.5 899.2 # # Test and branch timing: # integer compare and branch 0.688 uSec, 1453.5K/sec # float compare and branch 4.320 uSec, 231.5K/sec The divide speed would indicate about 9 ticks for 16 bit, about 12.5 for 32 bit. The 486-25 looks like this: # System id: HP 486-25, SCO ODT 1.0, 10MB, 300MB, cc # # Math operations, effective instructions/sec (thousands) # # Add Sub Mpy Div Wtd HM # short: 17934.1 18000.0 3483.9 2936.2 6400.4 # long: 20400.0 19695.6 3225.8 2042.6 5528.7 # float: 4258.1 4252.0 3829.8 1515.8 3276.1 # double: 4129.0 4087.9 3260.9 1345.8 2992.5 # # Test and branch timing: # integer compare and branch 0.247 uSec, 4054.0K/sec # float compare and branch 0.850 uSec, 1176.5K/sec This would indicate that the 486 didn't get much help on divide, but add and subtract, as well as compare and branch, got a big boost. My overall results for a bunch of programs was that the 486 was about 2.6x faster than the 386 at the same speed. Note: these figures are presented as ballpark figures, and represent measured performance obtained using C rather than assembler. While they are proportional to actual hardware performance, they are not best case performance. On the other hand I started building this benchmark suite in 1970... it measures performance of individual performance aspects, looking for those "jackpot cases" where performance is really bad. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"