Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!know!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!agate!linus!linus!sdimax2!jrv From: jrv@sdimax2.mitre.org (VanZandt) Newsgroups: comp.lang.c Subject: Re: count of bits set in a long Message-ID: <121816@linus.mitre.org> Date: 1 Oct 90 19:28:28 GMT References: <37545@ut-emx.uucp> <3820@goanna.cs.rmit.oz.au> <34281@cup.portal.com> <2859@litchi.bbn.com> <51467@brunix.UUCP> <121785@linus.mitre.org> Sender: usenet@linus.mitre.org Organization: The MITRE Corp., Bedford, MA Lines: 70 I think I have finally figured out how the first part of that bit count routine is working. I describe the algorithm as follows: 1. Establish a count in each 3 bit field of the number of bits set in the corresponding field of the input number. 2. Successively "double up": combine pairs of fields while summing their contents. 3. Sum up the contents of all the fields using a modulo operation. We have to start with a 3 bit field, because a 3 bit counter can count the bits in a 6 bit field, but a 2 bit counter can't count the bits in a 4 bit field. We have to "double up" at least once, because a 3 bit counter can't count the bits in a 32 bit word, while a 6 bit counter can. NOTE: with a 64 bit word, we would have to "double up" at least TWICE. We have two implementation decisions - how many times to "double up", and whether to do the modulo operation directly or with a loop as proposed by tac@cs.brown.edu. The loop is favored if '%' is expensive, as suggested by chris@mimsy.umd.edu, or if high order bits are usually not set. If the loop is used, it could pay to double up an extra time so fewer iterations are needed. (Finding x%077 requires log to the base 63 of x, or ln(x)/ln(63), iterations. x%07777 requires ln(x)/ln(07777) iterations, or about half as many). If we double up four times, no modulo operation is needed at all. For example... bitcount4(unsigned long mask) { unsigned long y; y = (mask >> 1) &033333333333L; y = mask - y - ((y >>1) & 033333333333L); /* each 3 bit field now contains the count of bits set in the corresponding field of 'mask' */ y = ((y >> 3) + y) & 030707070707L; /* double up */ /* each 6 bit field now contains the count of bits set in the corresponding field of 'mask' */ y = ((y >> 6) + y) & 007700770077L; /* double up 2nd time */ /* each 12 bit field now contains the count of bits set in the corresponding field of 'mask' */ return y % 07777; } I coded up six variations in all using Turbo C: bitcount3: naive routine using (y>>1) 32 times bitcount1: doubling up once, finding y%077 using loop in separate function bitcount0: doubling up once, finding y%077 using loop bitcount5: doubling up 4 times, no modulo operation needed bitcount4: doubling up twice, finding y%07777 directly bitcount2: doubling up once, finding y%077 directly The relative execution times on a '386, as recorded by the Turbo Profiler, are: _bitcount3 2.14 sec 22% |************************** _bitcount1+_mod63 0.80 sec 7% |******** _bitcount0 0.64 sec 6% |******* _bitcount5 0.38 sec 3% |**** _bitcount4 0.35 sec 3% |**** _bitcount2 0.15 sec 1% |* For this architecture, neither extra doublings nor use of the loop appear helpful. I seem to remember that an 8088 needs many more cycles for an integer division, suggesting that the loop might help there. - Jim Van Zandt