Path: utzoo!attcan!uunet!samsung!zaphod.mps.ohio-state.edu!usc!jarthur!nntp-server.caltech.edu!arrester!glennc From: glennc@arrester.caltech.edu (Glenn C. Smith) Newsgroups: comp.sys.mac.apps Subject: Re: How do so many words fit in a dictionary/thesaurus? Message-ID: Date: 15 Sep 90 05:38:43 GMT References: <3374@stl.stc.co.uk> <29477@netnews.upenn.edu> <90257.141912SAS102@psuvm.psu.edu> Sender: news@laguna.ccsf.caltech.edu Organization: California Institute of Technology Lines: 42 Email bounced and its of general interest so... > How do so many words fit into a dictionary/thesaurus? I'd like to > use the compression technique for a medical reference system I > am developing... [the above quote is completely paraphrased from memory] I've always had the impression that they use a big binary tree. The letters in a word then reduce to the bits defining a path down to a marker signifying the existence of a word. Each bit corresponds to left or right as you descend the binary tree. This is in line with the use of such dictionaries. Given a word in a spelling dictionary, you follow the path it defines down the tree and see if you end at nothing (or not). If you end at nothing then the word is misspelled, if you end on something, then the word is correctly spelled. This is also a very fast search. The words are not exactly compressed, they are encoded in the structure of the tree. To list out all the words in such a tree you would use a recursive descent process. This method saves tons of space. All words that share the same pattern of letters (read left to right) share the same storage space until they become unique. The nature of the data (the words in the English language) make the space savings vast. I hope I was clear here. I don't think this method of "compression" lends itself to larger ordered blocks of text like references for a medical database. Notice that using a binary tree preserves no order of each individual word accept alphabetical. Good luck and I hope this helps! Glenn C. Smith -- _________________________________________________________________________ Glenn C. Smith | "It is a weak mind that can think California Institute of Technology | of only one way to spell a word." glennc@ccoroadd.caltech.edu | --- "Build high for happiness." ---