Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!pacbell.com!pacbell!osc!jgk
From: jgk@osc.COM (Joe Keane)
Newsgroups: comp.theory
Subject: Re: What is the probability that I will be fired?
Summary: Too much.
Keywords: database, CRC-32, signature
Message-ID: <4908@osc.COM>
Date: 21 Jun 91 19:00:11 GMT
References: <438@fjcp60.GOV>
Reply-To: Joe Keane <jgk@osc.com>
Organization: Versant Object Technology, Menlo Park, CA
Lines: 36
X-Moon-Phase: Waxing Gibbous (76% of Full)

My rule of thumb is that you can run out of 32 bits, but you won't run out of
64 bits.  In other words, many situations can have 2^32 things, but very few
have 2^64 things.  For example, a fairly large disk contains 2^32 bits, and
you can execute 2^32 instructions in a few minutes.  So if you have an event
which occurs with probability 2^-32, it will probably happen eventually, while
2^-64 is likely to never happen.

Note that you have to double the number of bits for `birthday' problems, where
any matched pair is the relevant event.  For example, suppose you're making
Ethernet addresses at random, and you don't want to have any two the same.
Using 64 bits is risky, because when you have a billion machines, there's a
good chance that there will be two with the same address.  On the other hand,
with 128 bits i'd be confident that this will never happen.  Of course, this
assumes that the addresses are really random, and there's no hidden pattern.

For the application described, i think 32 bits is not enough.  It's likely
that eventually some block will mistakenly not be backed up.  I'd say 64 bits
is pretty safe.  However, since the application does a lot of disk work
anyway, the hashing probably isn't a critical part.  So i'd just use 128 bits.

Also, i wouldn't particularly trust CRC for this application.  There are some
patterns of changes that `fall through'.  It's great for dealing with random
bit errors, since it's extremely unlikely for them to cause this to happen.
But i wouldn't use it for comparing structured data, since then there may be
some data which happens to hit this pattern.  Then the probability of making a
mistake becomes much higher.

Making good hashing functions is something you could write a book about.  It's
closely related to making good encryption functions.  I'll give a quick
summary.  First, don't make it affine, like CRC.  The bits should get mixed up
well.  Second, make sure you don't lose information.  In other words, your
state transition function should be bijective as a function of the old state.
Given those two, i think you can do what you want and have it come out OK.
--
Joe Keane, amateur mathematician
jgk@osc.com (...!uunet!stratus!osc!jgk)