Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!usc!elroy.jpl.nasa.gov!ames!haven!adm!news
From: sysnmc@magic706.chron.com (Matt Cohen)
Newsgroups: comp.unix.questions
Subject: Re: spell
Message-ID: <23580@adm.BRL.MIL>
Date: 8 Jun 90 17:13:12 GMT
Sender: news@adm.BRL.MIL
Lines: 109


 > Specifically I would like to know:
 > 
 > 1. Why refuses spell to learn some words (either using +local_words, or using
 > spellin). E.g.: "rerouting".
 > 
 > 2. Why spell accepts some words as correct, though I see no way how they
 > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
 > (This holds for a SCO=Xenix distribution where I am absolutely sure, that
 > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
 > is).
 > 
 > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?
 > Is that file in any way related to /usr/lib/spell/words?

Ulrich,

	To the best of my knowledge, the way spell works is that it 
hashes the dictionary into a very large hash table. Unfortunately, the 
table would be ridiculously large if the entire hash table were stored 
in the usual fashion. (Each hash value pointing to a linked list of words 
with that hash value.) Instead, spell uses a technique called 
"optimistic hashing".

	What is done is to drop the linked list at each hash value. You 
can then use a single bit to say "at least one word in the dictionary hashes
to this value". This decreases space usage dramatically. However,
although the hash table used is huge, there is a chance that any
random word will hash to the same value as a word in the dictionary.

	When one of these non-words is identified, it is entered into
the "stop list", which is then hashed with a different hash function.
The stop list is typically small since it's difficult to identify
these words except by chance.

	If a word hashes to a set bit in the hashed dictionary and
to a clear bit in the stop list, it is declared correct.

	Also, the word is examined to see if it could be the result
of applying one of a bunch of prefixes/suffixes to a root word. All
of these potential root words are also checked, and if any of them passes,
the word is declared correct.

	--- Now, this explains your problems.

 > 1. Why refuses spell to learn some words (either using +local_words, or using
 > spellin). E.g.: "rerouting".

	It looks like some quirk of spell keeps it from recognizing that

	rerouting = re + route - e + ing .

	It seems to figure out "reroute" and "routing" ok.

	My guess is that "rerouting" hashes to the same value as a word in the
stop list. Therefore, when you create a dictionary with "rerouting" in it,
it can't pass.

 > 2. Why spell accepts some words as correct, though I see no way how they
 > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
 > (This holds for a SCO=Xenix distribution where I am absolutely sure, that
 > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
 > is).

	These words are "derived" from correct words.

	neeeds = nee + ed + s
	miist  = mi + ist

	The problem is that the dictionary contains no information
about which suffixes/prefixes are appropriate for each word. You
can therefore come up with many ridiculous things that spell says
are good words.

 > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?

	No. The hashed files contain none of the input words, see above.

 > Is that file in any way related to /usr/lib/spell/words?

	It is likely that a file similar to /usr/lib/spell/words
(/usr/dict/words on many Unixes) was hashed to produce /usr/lib/spell/hlista.
American spellings were probably naively changed to British and hashed
to produce hlistb.
	
	The source of hlista is probably not exactly /usr/dict/words, at
least on my version of Unix (SunOS 4.0.3). You can test this easily
with "spell /usr/lib/spell/words". By its very definition, no
word in the dictionary should be deemed incorrect.

	Whew! That was a mouthful. Hope it helped. You can gather
that modern spell checkers don't use this kind of optimistic hashing.
You may want to check out GNU's ispell program which solves many
of these problems. You can ftp it from uunet.uu.net. The files
are ~/gnu/{ispell.shar,ispell.el,dict.shar}. I can also send it to
you if you want.

	You may want to play with "spell -v" and "spell -x", if your 
system has these options.

	Please tell me what else you discover. Good luck!

			-- Matt

Matt Cohen                              INET: sysnmc@chron.com
Department of Technology Resources      UUCP: ...!uunet!chron!sysnmc
The Houston Chronicle                   AT&T: +1 713 220 7023
801 Texas Avenue, Houston, TX 77002     "The opinions above are most likely
"Houston's Leading Information Source"        those of my employer."