Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!decwrl!frnkmth!bill
From: bill@franklin.com (bill)
Newsgroups: comp.compression
Subject: Re: fast string compr.?
Message-ID: <24May91.206106.345@franklin.com>
Date: 24 May 91 20:51:06 GMT
References: <LACA.91May20141838@vipunen.hut.fi> <1991May23.235530.11443@looking.on.ca>
Distribution: comp
Organization: Franklin Electronic Publishers
Lines: 44

In article <1991May23.235530.11443@looking.on.ca>
	brad@looking.on.ca (Brad Templeton) writes:
: Data compression is not normally used for spell checkers.

Hokum. The data that goes into, for example, one of our Italian
spelling checkers is around 13M of text. It is compressed into
around 64K, reversibly. I think that can be called data
compression.

: Instead, most spell checkers use hashing.

No, Brad, they do not. As one of the developers of the most
widely used OEM spelling checker, and having had to take a good
look at our competitors at various times, I can say this with
authority.

Furthermore, hash techniques were generally discarded over nine
years ago because they accept way too many spelling errors. And,
no, you cannot make them acceptable by hard coding the common
spelling mistakes they fail on. No matter how low the probability
of false acceptance is made, it is still high enough to make the
customers wince.

*Please*, people, do not use this algorithm for a spelling
checker. It has its uses, but this is not one of them.

There was an article in a recent DDJ, if one is interested in
implementation details of the hash algorithm that Brad described.
I don't recall the exact title, but it had the word "existence" in
it. (That's an example of brain damage if ever I saw one; this
technique is useful for testing for the *nonexistence* of
something, or for screening for the *possible* existence of
something.)

BTW, the *correct* answer to the question of the "normal" way of
storing the lexicon is this: store the lexicon as a trie. This is
easily the best method for heavily inflected languages like
Finnish, and is used by most of the big players. The other method
of doing it is to store the dictionary in a sorted order possibly
with an in-memory index and/or a small cache of frequently used words
to avoid most disk lookups. This, too, typically involves a
rudimentary kind of data compression (dictionary style), in that
groups of inflected words are often stored as a root and
inflection information.