Path: utzoo!utgpu!watserv1!watmath!att!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!rice!uupsi!sunic!dkuug!freja.diku.dk!njk
From: njk@diku.dk (Niels J|rgen Kruse)
Newsgroups: comp.arch
Subject: Re: Compilers taking advantage of architectural enhancements
Message-ID: <1990Nov7.011427.8054@diku.dk>
Date: 7 Nov 90 01:14:27 GMT
References: <1990Oct9> <3300194@m.cs.uiuc.edu> <11922@ganymede.inmos.co.uk> <1990Oct31.203932.26325@cs.cmu.edu> <2709@l.cc.purdue.edu>
Organization: Department Of Computer Science, University Of Copenhagen
Lines: 83

cik@l.cc.purdue.edu (Herman Rubin) writes:
>Now for some instructions.  There are algorithms, doing all their computations
>on a few bits (usually < 10) which use these instructions.  As these are
>algorithms for generating non-uniform random variables, they are likely to
>be used thoudands or millions of times.
>       Find the distance to the next 1 in a bit stream.  Place the stream
>       pointer past that 1.  It is necessary to take into account that the
>       stream could be exhausted before a 1 is found.

It seems that you are generating multiple precision variables.
Does that make a large difference?  When generating 31 bit
exponentially distributed random numbers at least, i don't
see much need for this wonder instruction.

Some data.  I have measured the cost of the various generators
of my homegrown random number generation package on a SS1:

Cost / random number =                    1.59 uSec, Val0 = 0xe29e9cba
Cost / Exp. dist. random number =         6.77 uSec, Val1 = 0xe97afa5
Cost / random float =                     3.78 uSec, Val2 = -67.4
Cost / exp. dist. random float =          10.1 uSec, Val3 = 122.6
Cost / random double =                    6.64 uSec, Val4 = -49.4384703630
Cost / random remainder by         7 =    8.73 uSec, Val5 = 0x0
Cost / random remainder by      7777 =    5.58 uSec, Val6 = 0x35f
Cost / random remainder by     77777 =    12.1 uSec, Val7 = 0x18e6f
Cost / random remainder by  77777777 =    7.96 uSec, Val8 = 0x37d8808
Cost / random remainder by 777777777 =    7.83 uSec, Val9 = 0x2d2d4019

Timing error is on the order of 5%.
Each time is measured over a loop unrolled 20 times, which
generate a total of 1000000 numbers.  The hexadecimal values
printed, are the xor of the numbers produced.  The fp values
printed, are the sum of the numbers produced, with 20 * the
mean subtracted for every 20 numbers.  These values are printed
to avoid overeager optimization and so that it can be checked
that they remain unchanged, when running the code on different
architectures (gives a warm fussy feeling).
It shouldn't affect timing very much.

The double generator is a macro, which take 2 equidistributed
32 bit numbers out of a buffer, mask off the high bit, convert
to double, scale and add.  Simple no?  The generator of
                                                    26
exponentially distributed 31 bit numbers with mean 2   is a
macro which load a number with the appropriate distribution from
a buffer with 10 numbers.  The refill algorithm use random
minimization, which your proposed instruction is aimed at, and
consume an average of 2 equidistributed 32 bit numbers per
output number, just like the double generator.  The algorithm
is quite a lot more complex, however.  Yet, it is only slightly
slower.

I think integer <-> fp conversion or fp scaling by powers of 2
is *much* more in need of speeding up on this machine.

The random remainder generator is a macro which test how wide
the divisor is and select a routine to call accordingly.  If
the divisor fit in 13 bits, integer division is used after
truncation to 15 bits, else if it will fit in 29 bits, long
division is used after truncation to 31 bits, otherwise
unsigned long division is used.  Using gcc (like here), the
13 and 29 bit routines are inlined.

This split was chosen to speed up division on 16 bit machines
and machines that don't support unsigned division efficiently.
(It seems to help the sparc a lot too, which i hadn't expected).
With a constant divisor (like here), the testing and branching
optimize away and only the division remains. (There is a cheap
test after the division to guarantee exact equidistribution,
but this doesn't cost much).

Seeing that getting random remainders is more expensive than
getting exponentially distributed numbers in most cases, i
think that speeding up division should have a higher priority.

Conclusion: i think that random minimization is fast enough as
it is.

BTW, has anybody else noticed the bug in the algorithm for
random minimization (algorithm S), p. 128 in The Art of
Computer Programming, Vol 2, ed. 2?  The bit count should start
from 0, not 1, otherwise the generated numbers will have a mean
that is too large by log 2 times the desired mean.
-- 
Niels J|rgen Kruse 	DIKU Graduate 	njk@diku.dk