Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site harvard.ARPA
Path: utzoo!watmath!clyde!cbosgd!gatech!seismo!harvard!kosower
From: kosower@harvard.ARPA (David A. Kosower)
Newsgroups: net.internat
Subject: Hyphenation (Long message)
Message-ID: <501@harvard.ARPA>
Date: Thu, 14-Nov-85 02:58:34 EST
Article-I.D.: harvard.501
Posted: Thu Nov 14 02:58:34 1985
Date-Received: Fri, 15-Nov-85 05:04:04 EST
References: <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP>
Reply-To: kosower@harvard.ARPA
Distribution: net
Organization: Aiken Computation Laboratory, Harvard
Lines: 238


   In article <773@mmintl.UUCP>, Frank Adams claims that none of the
existing hyphenation algorithms are any good.  I believe this is
false, and will attempt to demonstrate this below.  But first,
a few comments on article <968@enea.UUCP>, by Erland Sommarskog.
He notes that there are words in Swedish whose hyphenation depends
on context.  This is a good point; in fact, such examples exist
in English, too: consider the word "record".  The verb is hyphenated
"re-cord", the noun "rec-ord".  This point is interesting not
because it indicates any fundamental problem preventing the
construction of an algorithmic hyphenator (it is in fact rather
irrelevant, as we shall see), but rather because it is one of
several reasons that dictionary lookup isn't a perfect solution
to the hyphenation, either.

   Before proceeding, we must quickly dispose of the conclusion
Sommarskog draws from this observation, namely that "the only sure
way to get a correct hyphenation is to have it interactive" [sic].
Nonsense!!  Note, for a trivial example, that on computers, the
identity (a*2)/2 = a is not valid!  Does this mean that integer
arithmetic must be done interactively??  Of course not.  It merely
means the programmer must have some way of knowing when the difference
between genuine integer arithmetic and integer arithmetic modulo
the word-size of the computer is important (easy), and that the
programmer should be able to get around this problem if he so
desires (also easy).  The parallel ingredients in our case are
the ability to recognize that a word was hyphenated incorrectly
(read the document you've written and consult a dictionary when
in doubt), and the ability to override the computer's mistake
(provided in any sensible text formatter).
[Note that the latter is in any event necessary, since different
authorities -- different dictionaries -- may disagree on the
correct hyphenation of a word: compare "in-de-pend-ent" from
the American Heritage Dictionary with "in-de-pen-dent" from 
Webster's.  British speakers may reference that greatest 
dictionary of all, the Oxford English Dictionary, to see what
it says about the word.]
In any event, surely you do not want to be asked to hyphenate any
numbers of words EVERY single time you run your document through
a text formatter, or every time your what-you-see-is-what-you-get
editor reads it in... good grief.

   Just to beat a dead horse a little bit more, and also because
the point is important, what makes you so sure that YOU can hyphenate
words correctly?  Hyphenation, I submit, is considerably more 
difficult than spelling; given the rather large number of poor
or mediocre spellers even among native speakers of a language,
at least a unphonetic language like English in which spelling is 
difficult, I would be quite surprised if even a relatively poor
hyphenation algorithm did not perform better than the vast
majority of native speakers.  [To see that this is NOT a trivial
point, consider the parallel situation for many syntactical or
semantic errors.]  Most native speakers will probably hyphenate
at least a fair percentage of words by... looking them up in
a printed dictionary.  To have a computer program, waiting for
a human user to look up a fact in a printed dictionary and type
it in, must surely be one of the better parodies of technological
progress.

   Those who do not believe the claim made in the preceeding 
paragraph are invited to perform the following experiment: get
a friend to pick 20 to 30 words out of a dictionary.  Hyphenate
them, and score yourself as a "hyphenation algorithm" by the
criteria described below.  Then repeat the experiment with 10
subject chosen and random from the local populace.

   So we really ought to have the computer do most of the hyphenation
work.  We want it to do it "well", so that we have to correct it as
little as possible.  What does "well" mean?  Obviously, we don't
just mean "yield as few incorrect hyphenations as possible", since
the trivial algorithm "yield no hyphenations" satisfies this
condition but surely is not a very useful hyphenation algorithm.
In fact, there are three significant numbers about any hyphenation
mechanism ("mechanism" here includes dictionary lookup):

   o  The percentage of incorrect hyphenations it produces.

   o  The percentage of all possible hyphenations that it actually
      finds.

   o  Its efficiency.

Both of the first two numbers should of course be measured for realistic
text samples, i.e. they should weighted for REALISTIC frequencies
of word appearances.  We want the first number to be as close to
zero as possible, and the second number to be as close to 100%
as possible.  But while we would probably not tolerate a percentage
of incorrect hyphens greater than about 5% (remember that hyphenation
isn't all that frequent in most documents, so this already amounts
to a rather infrequent error), we might well tolerate an algorithm
that produces signficantly less than 100% of all possible hyphens,
especially if the hyphens it does find break the word up into
small enough chunks; I would estimate that a figure as low as 70 to
80% might be acceptable here.

   Much of the remainder of this article is based upon Appendix H
of the TeXbook, by Donald Knuth, and upon the Knuth-Liang hyphenation
algorithm it describes.  The two have invented an algorithm which
satisfies the criteria described above, and its implementation in
TeX, Knuth's formatter, allows both for user modifications, and
for overriding the hyphenation decisions of the formatter in 
specific instances.  In practice, I have indeed found that the
algorithm works quite well.

  I will not describe the Liang algorithm in detail; those who
want more information should consult either the aforementioned
appendix of the TeXbook (by Donald E. Knuth, published jointly
by the American Mathematical Society and Addison-Wesley, 1984)
or Liang's Ph.D. thesis (Department of Computer Science,
Stanford University, 1983).  For our purposes, it will suffice to
note that the algorithm first looks up the word in an exception
dictionary, which is rather small (< 50 words), and for those
words not in the expection dictionary (the majority) uses tables
of word-fragment patterns to make a hyphenation decision.  These
tables occupy perhaps 20KB in compressed (but directly usable)
form.  Its vital statistics will also be useful (I have not
verified these; they are quoted directly from Appendix H
of the TeXbook):

  o   It hyphenates completely and correctly the 700 or so most
      common words in English.
  o   It finds 89.3% of the hyphens in a 115,000 word dictionary
      supplied by an unnamed publisher (I assume Merriam-Webster).
  o   It inserts NO incorrect hyphens for any of the words in
      that dictionary.

Let us now compare this algorithm and the other "obvious" method
of hyphenation, dictionary lookup.

   Dictionaries are of course necessary for document preparation,
e.g. for spelling checkers, and the preparation of computer-readable
dictionaries for foreign languages is both necessary and 
desirable.  I do not wish to belittle any such undertaking in
any way; on the contrary, I encourage it strongly.  Dictionaries
could in principle also be used to store hyphenation points,
and indeed, such dictionaries are needed as adjuncts to the preparation
of tables for the Knuth-Liang algorithm, and for checking its
accuracy.  But I do not believe they provide the most efficient
way of giving a text formatter information about hyphenation in
a given language.  I know this to be true for English, and I would
expect it to be true for most Indo-European languages.  For
other languages (Semitic and Finno-Ugric are the two other major 
groups that come to mind; note that languages written in Kanji
don't have hyphenation), it's anyone's guess.

   The example given at the very beginning, of a word whose hyphenation
depends on its meaning, and therefore on the context it is present
in (hah! there's another example: "pre-sent" and "pres-ent"), shows
that a dictionary, too, cannot give all possible hyphenations.
It must omit some.  Dictionaries are rather large, and thus likely
to contain errors; this will lead to a less-than-perfect score on
avoiding incorrect hyphens.  In fact, even a small dictionary will
contain about 50K words; since the average word length in English
is slightly over 4 characters, such a dictionary needs 200KB;
the most space-efficient encoding for the hyphenation points demands
another byte per word, so our baby dictionary is already 250KB.
Even in this day and age, accessing a 20KB table, and performing
a little computation is more efficient than accessing a 250KB
table -- in fact it gets better as time goes along, simply because
silicon hardware is getting faster more quickly than mass-storage devices.
But these large dictionaries aren't large enough.  English has a lot 
more words than 50K; so we really need a larger dictionary.  How
large?   Well, we have to include derived forms (e.g. "demonstration"
from  "demonstrate") since the hyphens for derived forms CANNOT be
derived easily from those for the root word (as I've said, hyphenation
is harder than spelling!); compare "dem-on-stra-tion" with
"de-mon-stra-tive".  We also will need to store words built up by
addition of suffixes and prefixes (how can we tell those that behave
regularly from those that don't?).  The Third Edition of
Merriam-Webster's dictionary contains 450,000 words, admittedly
including many compound nouns which probably don't need to be in a
hyphenation dictionary, but also omitting many forms which ought to be
included.  Now we're talking 2.3 megabytes.  I don't know how many
words the full Oxford English Dictionary (surely the most
comprehensive dictionary of any language ever assembled) includes, but
I wouldn't be surprised if the figure were 2 or 3 million.   So now
we're up to 10 megabytes (perhaps more).  Real estate on magnetic
media is cheap, but not THAT cheap.  And 10MB is not an amount of data
that can be stored easily in RAM (and therefored accessed
efficiently).

  Of course, one could stop at any stage, and either (a) give up, and
demand that the user specify explicit hyphenations for everything
that's left or (b) extend the dictionary with (horrors!) an
algorithmic method of hyphenation.  Either choice concedes that
there's really something to algorithmic hyphenation after all.  The
second choice shows that the Knuth-Liang algorithm is one particular
choice in a wide range of choices in a division of labor between a
dictionary and an algorithm.  But it is a careful choice, based on
analysis, not on a haphazard gut feeling.  It shows that a tiny
dictionary in combination with a pattern-driven alogrithm can work,
and work efficiently.

   In fact, the patterns that Liang has collected contain quite a bit
of information about the structure of words in English not to be found
by lookup in any dictionary.  The last class of additions we must
consider are coined words, some of which may derive from known words,
and other which may not, and bits of common usage.  Living languages,
after all, are truly alive: they evolve and change, adding new words
and new usages as time moves on.  There is inevitably a time lag
between the introduction of a new word, and its acceptance to the
point it appears in a standardized dictionary.  But new words do
follow certain deep rules of a language; we may not understand these
rules, or be able to enunciate them; but if we find an algorithm that
captures the essence of these rules, we have found a real gem.
Liang's algorithm is such a gem.  As examples, I will consider coined
words and nonsense words, the latter precisely because they sound as
though they ought to be English (some such words, e.g.  "chortle" =
combination of "chuckle" and "snort", do eventually become
full-fledged words).  In particular, consider "yuppie" (coined),
"quining" (from "quine", coined by Douglas Hofstadter in honor of the
logician W. V. Quine), and "brillig" (nonsense word).  They  are not
in any dictionary.  But Liang's algorithm hyphenates them correctly
[in my judgement]: "yup-pie", "quin-ing", and "bril-lig".

   As I said in my earlier posting, a great deal of knowledge has been
accumulated over the course of many years' work on hyphenators for English.  
These efforts have culminated in a workable scheme.  For related languages, 
I suspect the hardest part has already been done: finding an efficient, 
successful algorithm.  One must construct the appropriate tables for the new
target language, check the accuracy of the algorithm, and build up
lists of exceptions which must be treated specially.  Other general
rules (the equivalent of "no discretionary hyphens may be placed in a
hyphenated compound noun" and "do not hyphenate after the first
character of a word" in English) must be formulated.  For unrelated
languages, the problem may as yet be unsolved.  It might even be
unsolvable, though I doubt that.  Computers have evolved to the point
where matters once beyond the "competence of computers" are amenable
to algorithmic attack.  I am specifically avoiding AI-ish approaches;
these have failed more than they have succeeded, and they have given a
bad name to any computation-based approach.  For the hyphenation
problem, this bad name is undeserved.  The moral is that we should
be looking for such algorithms for a variety of problems, not ranting and
raving about how they can't work, and how primitive, knee-jerk solutions
are "obviously" the only possible ones.

                                       David A. Kosower
                                       kosower@harvard.ARPA