Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!usc!elroy.jpl.nasa.gov!ames!haven!adm!news From: sysnmc@magic706.chron.com (Matt Cohen) Newsgroups: comp.unix.questions Subject: Re: spell Message-ID: <23580@adm.BRL.MIL> Date: 8 Jun 90 17:13:12 GMT Sender: news@adm.BRL.MIL Lines: 109 > Specifically I would like to know: > > 1. Why refuses spell to learn some words (either using +local_words, or using > spellin). E.g.: "rerouting". > > 2. Why spell accepts some words as correct, though I see no way how they > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist". > (This holds for a SCO=Xenix distribution where I am absolutely sure, that > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that > is). > > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]? > Is that file in any way related to /usr/lib/spell/words? Ulrich, To the best of my knowledge, the way spell works is that it hashes the dictionary into a very large hash table. Unfortunately, the table would be ridiculously large if the entire hash table were stored in the usual fashion. (Each hash value pointing to a linked list of words with that hash value.) Instead, spell uses a technique called "optimistic hashing". What is done is to drop the linked list at each hash value. You can then use a single bit to say "at least one word in the dictionary hashes to this value". This decreases space usage dramatically. However, although the hash table used is huge, there is a chance that any random word will hash to the same value as a word in the dictionary. When one of these non-words is identified, it is entered into the "stop list", which is then hashed with a different hash function. The stop list is typically small since it's difficult to identify these words except by chance. If a word hashes to a set bit in the hashed dictionary and to a clear bit in the stop list, it is declared correct. Also, the word is examined to see if it could be the result of applying one of a bunch of prefixes/suffixes to a root word. All of these potential root words are also checked, and if any of them passes, the word is declared correct. --- Now, this explains your problems. > 1. Why refuses spell to learn some words (either using +local_words, or using > spellin). E.g.: "rerouting". It looks like some quirk of spell keeps it from recognizing that rerouting = re + route - e + ing . It seems to figure out "reroute" and "routing" ok. My guess is that "rerouting" hashes to the same value as a word in the stop list. Therefore, when you create a dictionary with "rerouting" in it, it can't pass. > 2. Why spell accepts some words as correct, though I see no way how they > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist". > (This holds for a SCO=Xenix distribution where I am absolutely sure, that > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that > is). These words are "derived" from correct words. neeeds = nee + ed + s miist = mi + ist The problem is that the dictionary contains no information about which suffixes/prefixes are appropriate for each word. You can therefore come up with many ridiculous things that spell says are good words. > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]? No. The hashed files contain none of the input words, see above. > Is that file in any way related to /usr/lib/spell/words? It is likely that a file similar to /usr/lib/spell/words (/usr/dict/words on many Unixes) was hashed to produce /usr/lib/spell/hlista. American spellings were probably naively changed to British and hashed to produce hlistb. The source of hlista is probably not exactly /usr/dict/words, at least on my version of Unix (SunOS 4.0.3). You can test this easily with "spell /usr/lib/spell/words". By its very definition, no word in the dictionary should be deemed incorrect. Whew! That was a mouthful. Hope it helped. You can gather that modern spell checkers don't use this kind of optimistic hashing. You may want to check out GNU's ispell program which solves many of these problems. You can ftp it from uunet.uu.net. The files are ~/gnu/{ispell.shar,ispell.el,dict.shar}. I can also send it to you if you want. You may want to play with "spell -v" and "spell -x", if your system has these options. Please tell me what else you discover. Good luck! -- Matt Matt Cohen INET: sysnmc@chron.com Department of Technology Resources UUCP: ...!uunet!chron!sysnmc The Houston Chronicle AT&T: +1 713 220 7023 801 Texas Avenue, Houston, TX 77002 "The opinions above are most likely "Houston's Leading Information Source" those of my employer."