Path: utzoo!utstat!jarvis.csri.toronto.edu!mailrus!wuarchive!usc!samsung!uunet!twwells!bill
From: bill@twwells.com (T. William Wells)
Newsgroups: news.software.b
Subject: Re: Not Checksums, how about string search hash codes?
Message-ID: <1989Nov22.105937.26531@twwells.com>
Date: 22 Nov 89 10:59:37 GMT
References: <49602@looking.on.ca> <1989Nov20.073448.18953@twwells.com> <50572@looking.on.ca>
Organization: None, Ft. Lauderdale, FL
Lines: 35

In article <50572@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
: In article <1989Nov20.073448.18953@twwells.com> bill@twwells.com (T. William Wells) writes:
: >This is easy enough. Just take any hashing function and use it to
: >set bits in a bit table for each word in the article. If a hashed
: >word creates a bit that is not in the table, the word is not in
: >the article.
:
: Is that all it is?  Seems that would require a lot of bits.  There were
: 151 unique words in your article, so you would need a 1500 bit table
: (188 bytes, or 226 bytes of printable characters) to get a 10%
: hit ratio using this algorithm.
:
: Of course you could have the hit ratio vary according to the article,
: but 226 seems like a lot.  I thought there were more compact algorithms.

Well, that is what having several hash codes ORed into the same
table gets you. If you have k independent (and perfectly
randomizing) hash codes, n bits in the table, and m words, the
probability of having all hash codes find a 1 in the table is (if
I did the math right)

	(1 - (1 - 1 / n) ** (m * k)) ** k

So, for one hash code, a random word has a 9.5% chance of being in
the table when it isn't in the article. But with two hash codes,
this becomes 3.3%. Alternately, one could shrink the table to
obtain the same failure rate as with one hash code.

This formula breaks down when the table has high occupancy,
because it ignores quantization. And hash codes are never
perfect. So experimentation is in order.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com