Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!snorkelwacker.mit.edu!stanford.edu!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.lang.c
Subject: Re: hash function
Message-ID: <17918.Jun2608.17.5091@kramden.acf.nyu.edu>
Date: 26 Jun 91 08:17:50 GMT
References: <14583@dog.ee.lbl.gov> <29997.Jun2503.57.3491@kramden.acf.nyu.edu> <15047@ulysses.att.com>
Organization: IR
Lines: 22

In article <15047@ulysses.att.com> kpv@ulysses.att.com (Phong Vo[drew]) writes:
> For everyone's entertainment, I did a simple (and admittedly contrived) test
> of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
> This is done by treating each number which is stored in a word as a
> string of 4 bytes.
  [ results supposedly showing 261 much better than 33 ]

Sorry, but the mod has to be a power of 2. Also, your test is remarkably
contrived: we're talking about string hash functions, and typical
strings (a) are concentrated on a few character values, not spread
evenly over the entire character set; (b) do not all have the same high
byte; (c) do not all have the same length.

Surely you noticed that a multiplier of 256 will produce perfect hashing
in this case. Does that mean you're recommending a multiplier of 256?

The profiling statistics I've saved from various compressors form a much
more realistic sample. They show 33 as slightly better than random
hashing on typical files, 37 as slightly worse. I've never tried 261 or
31, but I'll bet 261 does worse than 33 on text files.

---Dan