Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!pacbell.com!pacbell!osc!jgk From: jgk@osc.COM (Joe Keane) Newsgroups: comp.theory Subject: Re: What is the probability that I will be fired? Summary: Too much. Keywords: database, CRC-32, signature Message-ID: <4908@osc.COM> Date: 21 Jun 91 19:00:11 GMT References: <438@fjcp60.GOV> Reply-To: Joe Keane Organization: Versant Object Technology, Menlo Park, CA Lines: 36 X-Moon-Phase: Waxing Gibbous (76% of Full) My rule of thumb is that you can run out of 32 bits, but you won't run out of 64 bits. In other words, many situations can have 2^32 things, but very few have 2^64 things. For example, a fairly large disk contains 2^32 bits, and you can execute 2^32 instructions in a few minutes. So if you have an event which occurs with probability 2^-32, it will probably happen eventually, while 2^-64 is likely to never happen. Note that you have to double the number of bits for `birthday' problems, where any matched pair is the relevant event. For example, suppose you're making Ethernet addresses at random, and you don't want to have any two the same. Using 64 bits is risky, because when you have a billion machines, there's a good chance that there will be two with the same address. On the other hand, with 128 bits i'd be confident that this will never happen. Of course, this assumes that the addresses are really random, and there's no hidden pattern. For the application described, i think 32 bits is not enough. It's likely that eventually some block will mistakenly not be backed up. I'd say 64 bits is pretty safe. However, since the application does a lot of disk work anyway, the hashing probably isn't a critical part. So i'd just use 128 bits. Also, i wouldn't particularly trust CRC for this application. There are some patterns of changes that `fall through'. It's great for dealing with random bit errors, since it's extremely unlikely for them to cause this to happen. But i wouldn't use it for comparing structured data, since then there may be some data which happens to hit this pattern. Then the probability of making a mistake becomes much higher. Making good hashing functions is something you could write a book about. It's closely related to making good encryption functions. I'll give a quick summary. First, don't make it affine, like CRC. The bits should get mixed up well. Second, make sure you don't lose information. In other words, your state transition function should be bijective as a function of the old state. Given those two, i think you can do what you want and have it come out OK. -- Joe Keane, amateur mathematician jgk@osc.com (...!uunet!stratus!osc!jgk)