Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!snorkelwacker.mit.edu!stanford.edu!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.lang.c Subject: Re: hash function Message-ID: <17918.Jun2608.17.5091@kramden.acf.nyu.edu> Date: 26 Jun 91 08:17:50 GMT References: <14583@dog.ee.lbl.gov> <29997.Jun2503.57.3491@kramden.acf.nyu.edu> <15047@ulysses.att.com> Organization: IR Lines: 22 In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes: > For everyone's entertainment, I did a simple (and admittedly contrived) test > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets. > This is done by treating each number which is stored in a word as a > string of 4 bytes. [ results supposedly showing 261 much better than 33 ] Sorry, but the mod has to be a power of 2. Also, your test is remarkably contrived: we're talking about string hash functions, and typical strings (a) are concentrated on a few character values, not spread evenly over the entire character set; (b) do not all have the same high byte; (c) do not all have the same length. Surely you noticed that a multiplier of 256 will produce perfect hashing in this case. Does that mean you're recommending a multiplier of 256? The profiling statistics I've saved from various compressors form a much more realistic sample. They show 33 as slightly better than random hashing on typical files, 37 as slightly worse. I've never tried 261 or 31, but I'll bet 261 does worse than 33 on text files. ---Dan