Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!lll-lcc!styx!twg-ap!amdahl!pyramid!prls!philabs!micomvax!musocs!mcgill-vision!mouse From: mouse@mcgill-vision.UUCP Newsgroups: comp.unix.wizards,comp.unix.questions Subject: Re: Spell and /usr/dict/web2 Message-ID: <723@mcgill-vision.UUCP> Date: Wed, 1-Apr-87 19:24:20 EST Article-I.D.: mcgill-v.723 Posted: Wed Apr 1 19:24:20 1987 Date-Received: Thu, 9-Apr-87 00:11:25 EST References: <135@utecfb.Toronto.Edu> <2563@hcr.UUCP> <12630@watnot.UUCP> Organization: McGill University, Montreal Lines: 40 Keywords: spell dictionary web2 Xref: utgpu comp.unix.wizards:1687 comp.unix.questions:1637 In article <12630@watnot.UUCP>, ccplumb@watnot.UUCP writes: > In article <2563@hcr.UUCP> mike@hcr.UUCP (Mike Tilson) writes: >> The algorithm works best when the final bit table has about 50% ones >> and 50% zeros. > Um... not quite. If the table is 50% ones, then 50% of any > collection of random character strings would be accepted as correct. > As I recall, from Jon Bentley's discussion of spell in Programming > Pearls, the hash table size was chosen so that, assuming 20 reported > problems per run, a misspelling would slip through about once every > hundred runs, so the ones should fill up 1/2000 of the hash table. Which spell are you talking about? Ours uses multiple bits per word; here's how it works....(don't ask me about the math, it's been too long since I took probability) /* * Hash table for spelling checker has n bits. * Each word w is hashed by k different (modular) hash functions, hi. * The bits hi(w), i=1..k, are set for words in the dictionary. * Assuming independence, the probability that no word of a d-word * dictionary sets a particular bit is given by the Poisson formula * P = exp(-y)*y**0/0!, where y=d*k/n. * The probability that a random string is recognized as a word is then * (1-P)**k. For given n and d this is minimum when y=log(2), P=1/2, * whence one finds, for example, that a 25000-word dictionary in a * 400000-bit table works best with k=11. */ 50% ones and 50% zeros maximizes the information content of the table. Assuming whoever wrote that comment knows their probability (and that their assumptions are correct), the chance of a random string being accepted is thus 1/2048. der Mouse Smart mailers: mouse@mcgill-vision.uucp USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!musocs!mcgill-vision!mouse think!mosart!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu