Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!tut.cis.ohio-state.edu!ucbvax!agate!darkstar!ICS.UCI.EDU
From: rfg@ICS.UCI.EDU (Ronald Guilmette)
Newsgroups: comp.os.research
Subject: Re: Cache theory question (UNIX)
Message-ID: <1478@darkstar.ucsc.edu>
Date: 25 Feb 90 21:05:17 GMT
Sender: usenet@darkstar.ucsc.edu
Organization: UC Irvine Department of ICS
Lines: 87
Approved: comp-os-research@jupiter.ucsc.edu

In article <1456@darkstar.ucsc.edu> cjp%megatek.UUCP@ucsd.edu (Chris Pikus) writes:
>
>	The Unix disk buffer cashe is organized as
>a combination of both a hash table and a linked list.
>that is there several linked lists and which one to use
>is determined by a hashing function.
>
>	To wit, when one wishes to access/add a particular block
>in the buffer cache, the kernel performs a hashing function
>on BLKNO (usually a mod( 17) i beleive) and then walks down
>the that list to find said BLKNO.
>
>	My question is thus: WHY mod 17???
>
>	I know the answer is fairly obvious, since 17 is
>a prime number, there would more likely to be an even
>distribution amongst the lists. If something like a filesystem
>started to have certain things regularly spaced, they might have
>a tendency to "resonate" adn fill up one list if one used a number
>like 16.
>
>	To be more precise, is there any reason they picked 17, and
>if so, is there any supporting documentation/articles/mathematical_treatises?
>Could they have picked any prime (like 29, or 53, or 2011)?
>
>	I know this is a pretty esoteric question but I was
>curious and I know there is a lot of UNIX historians on the net.
>
>	Thank you very much,
>	Christopher J. Pikus

I am curious about this very same question.

Recently, I was writting a tool which reads assembly code for a particular
machine and does some mangling on it.  While doing this, I was hunting for
a nice hash function for the valid mnemonics for that machine.  I quickly
found that hash functions which involved prime numbers gave much better
results (i.e. in terms of minimizing collisions) than did ones without a
prime involved.  A quick check of Knuth's books confirmed this to be a well
know fact.

Now the only question remaining for me was "Which prime should I use for
this particular set of mnemonics?"  I wrote a short program which would try
all of the primes up to 65536 with the following function:


	unsigned int hash_mask = /* some power of two */ ;

	static unsigned int
	hash (const char *string, unsigned int prime)
	{
	  unsigned int ret_val = 0;

	  for (const char *p = string; *p; p++)
	    ret_val = (*p ^ ret_val) * prime;
	  return ret_val & hash_mask;
	}

I had the program evaluate the "efficiency" of all of the various primes for
the given mnemonic set.

For the mnemonic set I was interested in (consisting of 165 strings) my
program found a prime that (when used with a primary table size of 2048)
yielded a perfect hash function.  The prime was 5381.

Curiously however, the prime number 17 also yielded good efficiency (i.e.
a small percentage of collisions) with several different primary table
sizes (with my specific input key set).  Other numbers that seemed good
with various table sizes were 19 and 29, but 17 was always better.

Now in my case, I had a fixed input key set that was not going to be changed,
so I went with 5381 as my prime, but it occured to me that a different prime
would probably be best for my hash function which hashed *identifiers*.

Obviously, there is no way to predict the set of identifiers which will be
used in a given input program, so I would like to know if there have been
any studies which would suggest that any particular prime number (17?)
would (on average) yield good results for arbitrary sets of input strings
and arbitrary primary table sizes.

Anybody know?  Shall I do one?  Is there any theoretical basis for determining
the one "right" answer to this question (if there is one), or is an empirical
study the only way to come up with (at least) a good guess?

P.S.  Knuth uses 1009 in his examples, but what does he know! :-)

// rfg