Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!rice!uupsi!sunic!dkuug!freja.diku.dk!njk From: njk@diku.dk (Niels J|rgen Kruse) Newsgroups: comp.arch Subject: Re: Compilers taking advantage of architectural enhancements Message-ID: <1990Nov7.011427.8054@diku.dk> Date: 7 Nov 90 01:14:27 GMT References: <1990Oct9> <3300194@m.cs.uiuc.edu> <11922@ganymede.inmos.co.uk> <1990Oct31.203932.26325@cs.cmu.edu> <2709@l.cc.purdue.edu> Organization: Department Of Computer Science, University Of Copenhagen Lines: 83 cik@l.cc.purdue.edu (Herman Rubin) writes: >Now for some instructions. There are algorithms, doing all their computations >on a few bits (usually < 10) which use these instructions. As these are >algorithms for generating non-uniform random variables, they are likely to >be used thoudands or millions of times. > Find the distance to the next 1 in a bit stream. Place the stream > pointer past that 1. It is necessary to take into account that the > stream could be exhausted before a 1 is found. It seems that you are generating multiple precision variables. Does that make a large difference? When generating 31 bit exponentially distributed random numbers at least, i don't see much need for this wonder instruction. Some data. I have measured the cost of the various generators of my homegrown random number generation package on a SS1: Cost / random number = 1.59 uSec, Val0 = 0xe29e9cba Cost / Exp. dist. random number = 6.77 uSec, Val1 = 0xe97afa5 Cost / random float = 3.78 uSec, Val2 = -67.4 Cost / exp. dist. random float = 10.1 uSec, Val3 = 122.6 Cost / random double = 6.64 uSec, Val4 = -49.4384703630 Cost / random remainder by 7 = 8.73 uSec, Val5 = 0x0 Cost / random remainder by 7777 = 5.58 uSec, Val6 = 0x35f Cost / random remainder by 77777 = 12.1 uSec, Val7 = 0x18e6f Cost / random remainder by 77777777 = 7.96 uSec, Val8 = 0x37d8808 Cost / random remainder by 777777777 = 7.83 uSec, Val9 = 0x2d2d4019 Timing error is on the order of 5%. Each time is measured over a loop unrolled 20 times, which generate a total of 1000000 numbers. The hexadecimal values printed, are the xor of the numbers produced. The fp values printed, are the sum of the numbers produced, with 20 * the mean subtracted for every 20 numbers. These values are printed to avoid overeager optimization and so that it can be checked that they remain unchanged, when running the code on different architectures (gives a warm fussy feeling). It shouldn't affect timing very much. The double generator is a macro, which take 2 equidistributed 32 bit numbers out of a buffer, mask off the high bit, convert to double, scale and add. Simple no? The generator of 26 exponentially distributed 31 bit numbers with mean 2 is a macro which load a number with the appropriate distribution from a buffer with 10 numbers. The refill algorithm use random minimization, which your proposed instruction is aimed at, and consume an average of 2 equidistributed 32 bit numbers per output number, just like the double generator. The algorithm is quite a lot more complex, however. Yet, it is only slightly slower. I think integer <-> fp conversion or fp scaling by powers of 2 is *much* more in need of speeding up on this machine. The random remainder generator is a macro which test how wide the divisor is and select a routine to call accordingly. If the divisor fit in 13 bits, integer division is used after truncation to 15 bits, else if it will fit in 29 bits, long division is used after truncation to 31 bits, otherwise unsigned long division is used. Using gcc (like here), the 13 and 29 bit routines are inlined. This split was chosen to speed up division on 16 bit machines and machines that don't support unsigned division efficiently. (It seems to help the sparc a lot too, which i hadn't expected). With a constant divisor (like here), the testing and branching optimize away and only the division remains. (There is a cheap test after the division to guarantee exact equidistribution, but this doesn't cost much). Seeing that getting random remainders is more expensive than getting exponentially distributed numbers in most cases, i think that speeding up division should have a higher priority. Conclusion: i think that random minimization is fast enough as it is. BTW, has anybody else noticed the bug in the algorithm for random minimization (algorithm S), p. 128 in The Art of Computer Programming, Vol 2, ed. 2? The bit count should start from 0, not 1, otherwise the generated numbers will have a mean that is too large by log 2 times the desired mean. -- Niels J|rgen Kruse DIKU Graduate njk@diku.dk