Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!olivea!samsung!uakari.primate.wisc.edu!caen!spool.mu.edu!uunet!fjcp60!brinkema From: brinkema@fjc.GOV (John R. Brinkema) Newsgroups: comp.theory Subject: What is the probability that I will be fired? Summary: How safe is a CRC-32 in this case? Keywords: database, CRC-32, signature Message-ID: <438@fjcp60.GOV> Date: 18 Jun 91 20:08:06 GMT Distribution: all Organization: Federal Judicial Center, Washington, D.C. Lines: 37 I have developed an program to do a 'quick' copy of a large database (~1Gb) from one machine to another. The idea is to update an existing old copy of the database on the second machine by only copying the changes on the first and updating the second. These changes are detected by comparing before and after CRC-32 (CCITT) ckecksum on (currently) 4Kbyte chunks of the data bases. If the CRC's are different, I copy and update; otherwise I assume that the blocks are the same. In the enviroment that the program runs, this is a reasonalbe idea: the first database file is the master; the second is a slave of the first and we can tolerate some (small, I believe) probability that the second database is different (heck, lets use the C word: corrupted). In fact, in practice no problems have been encountered; but we do a full copy of the master to slave database periodically. This copy has saved us a tremendous amount of time and our operations staff loves it. On the otherhand, we *do* rely on the database and I am sure that if I messed it up I would have some very unhappy users (including my boss's boss). In a fit of optimism that I now regret, I recently suggested that *backups* on the first machine could be speed up by doing incremental back- ups using the same algorithm. That is, bet-it-all on the goodness of this alogorithm to identify changed blocks with a probability less than or equal the probability that media or operations or god (who was recently given a pair of dice by a downstairs neighbor) does something horrible and unrecoverable anyway. The database changes are random ASCII and Integer fields in a packed record randomly placed in the chunk. I have no specific informantion on the pattern of bit changes between CRC'ed chunks; I assume they are essentially random rather than burst or periodic or something nicely mathematical. I am willing to tolerate false-differences (other than all chunks -- thats not fair). So, finally the questions: how bad is this algorithm? Is there some other checksum that will give better results - that is, better probability that different chunks checksum differently. What *is* the probability that different chunks checksum the same (as a function of something I can control such as the size of the chuck). Or, in otherwords, what is the probability that I will be fired? jb.