Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site harvard.ARPA Path: utzoo!watmath!clyde!cbosgd!gatech!seismo!harvard!kosower From: kosower@harvard.ARPA (David A. Kosower) Newsgroups: net.internat Subject: Hyphenation (Long message) Message-ID: <501@harvard.ARPA> Date: Thu, 14-Nov-85 02:58:34 EST Article-I.D.: harvard.501 Posted: Thu Nov 14 02:58:34 1985 Date-Received: Fri, 15-Nov-85 05:04:04 EST References: <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP> Reply-To: kosower@harvard.ARPA Distribution: net Organization: Aiken Computation Laboratory, Harvard Lines: 238 In article <773@mmintl.UUCP>, Frank Adams claims that none of the existing hyphenation algorithms are any good. I believe this is false, and will attempt to demonstrate this below. But first, a few comments on article <968@enea.UUCP>, by Erland Sommarskog. He notes that there are words in Swedish whose hyphenation depends on context. This is a good point; in fact, such examples exist in English, too: consider the word "record". The verb is hyphenated "re-cord", the noun "rec-ord". This point is interesting not because it indicates any fundamental problem preventing the construction of an algorithmic hyphenator (it is in fact rather irrelevant, as we shall see), but rather because it is one of several reasons that dictionary lookup isn't a perfect solution to the hyphenation, either. Before proceeding, we must quickly dispose of the conclusion Sommarskog draws from this observation, namely that "the only sure way to get a correct hyphenation is to have it interactive" [sic]. Nonsense!! Note, for a trivial example, that on computers, the identity (a*2)/2 = a is not valid! Does this mean that integer arithmetic must be done interactively?? Of course not. It merely means the programmer must have some way of knowing when the difference between genuine integer arithmetic and integer arithmetic modulo the word-size of the computer is important (easy), and that the programmer should be able to get around this problem if he so desires (also easy). The parallel ingredients in our case are the ability to recognize that a word was hyphenated incorrectly (read the document you've written and consult a dictionary when in doubt), and the ability to override the computer's mistake (provided in any sensible text formatter). [Note that the latter is in any event necessary, since different authorities -- different dictionaries -- may disagree on the correct hyphenation of a word: compare "in-de-pend-ent" from the American Heritage Dictionary with "in-de-pen-dent" from Webster's. British speakers may reference that greatest dictionary of all, the Oxford English Dictionary, to see what it says about the word.] In any event, surely you do not want to be asked to hyphenate any numbers of words EVERY single time you run your document through a text formatter, or every time your what-you-see-is-what-you-get editor reads it in... good grief. Just to beat a dead horse a little bit more, and also because the point is important, what makes you so sure that YOU can hyphenate words correctly? Hyphenation, I submit, is considerably more difficult than spelling; given the rather large number of poor or mediocre spellers even among native speakers of a language, at least a unphonetic language like English in which spelling is difficult, I would be quite surprised if even a relatively poor hyphenation algorithm did not perform better than the vast majority of native speakers. [To see that this is NOT a trivial point, consider the parallel situation for many syntactical or semantic errors.] Most native speakers will probably hyphenate at least a fair percentage of words by... looking them up in a printed dictionary. To have a computer program, waiting for a human user to look up a fact in a printed dictionary and type it in, must surely be one of the better parodies of technological progress. Those who do not believe the claim made in the preceeding paragraph are invited to perform the following experiment: get a friend to pick 20 to 30 words out of a dictionary. Hyphenate them, and score yourself as a "hyphenation algorithm" by the criteria described below. Then repeat the experiment with 10 subject chosen and random from the local populace. So we really ought to have the computer do most of the hyphenation work. We want it to do it "well", so that we have to correct it as little as possible. What does "well" mean? Obviously, we don't just mean "yield as few incorrect hyphenations as possible", since the trivial algorithm "yield no hyphenations" satisfies this condition but surely is not a very useful hyphenation algorithm. In fact, there are three significant numbers about any hyphenation mechanism ("mechanism" here includes dictionary lookup): o The percentage of incorrect hyphenations it produces. o The percentage of all possible hyphenations that it actually finds. o Its efficiency. Both of the first two numbers should of course be measured for realistic text samples, i.e. they should weighted for REALISTIC frequencies of word appearances. We want the first number to be as close to zero as possible, and the second number to be as close to 100% as possible. But while we would probably not tolerate a percentage of incorrect hyphens greater than about 5% (remember that hyphenation isn't all that frequent in most documents, so this already amounts to a rather infrequent error), we might well tolerate an algorithm that produces signficantly less than 100% of all possible hyphens, especially if the hyphens it does find break the word up into small enough chunks; I would estimate that a figure as low as 70 to 80% might be acceptable here. Much of the remainder of this article is based upon Appendix H of the TeXbook, by Donald Knuth, and upon the Knuth-Liang hyphenation algorithm it describes. The two have invented an algorithm which satisfies the criteria described above, and its implementation in TeX, Knuth's formatter, allows both for user modifications, and for overriding the hyphenation decisions of the formatter in specific instances. In practice, I have indeed found that the algorithm works quite well. I will not describe the Liang algorithm in detail; those who want more information should consult either the aforementioned appendix of the TeXbook (by Donald E. Knuth, published jointly by the American Mathematical Society and Addison-Wesley, 1984) or Liang's Ph.D. thesis (Department of Computer Science, Stanford University, 1983). For our purposes, it will suffice to note that the algorithm first looks up the word in an exception dictionary, which is rather small (< 50 words), and for those words not in the expection dictionary (the majority) uses tables of word-fragment patterns to make a hyphenation decision. These tables occupy perhaps 20KB in compressed (but directly usable) form. Its vital statistics will also be useful (I have not verified these; they are quoted directly from Appendix H of the TeXbook): o It hyphenates completely and correctly the 700 or so most common words in English. o It finds 89.3% of the hyphens in a 115,000 word dictionary supplied by an unnamed publisher (I assume Merriam-Webster). o It inserts NO incorrect hyphens for any of the words in that dictionary. Let us now compare this algorithm and the other "obvious" method of hyphenation, dictionary lookup. Dictionaries are of course necessary for document preparation, e.g. for spelling checkers, and the preparation of computer-readable dictionaries for foreign languages is both necessary and desirable. I do not wish to belittle any such undertaking in any way; on the contrary, I encourage it strongly. Dictionaries could in principle also be used to store hyphenation points, and indeed, such dictionaries are needed as adjuncts to the preparation of tables for the Knuth-Liang algorithm, and for checking its accuracy. But I do not believe they provide the most efficient way of giving a text formatter information about hyphenation in a given language. I know this to be true for English, and I would expect it to be true for most Indo-European languages. For other languages (Semitic and Finno-Ugric are the two other major groups that come to mind; note that languages written in Kanji don't have hyphenation), it's anyone's guess. The example given at the very beginning, of a word whose hyphenation depends on its meaning, and therefore on the context it is present in (hah! there's another example: "pre-sent" and "pres-ent"), shows that a dictionary, too, cannot give all possible hyphenations. It must omit some. Dictionaries are rather large, and thus likely to contain errors; this will lead to a less-than-perfect score on avoiding incorrect hyphens. In fact, even a small dictionary will contain about 50K words; since the average word length in English is slightly over 4 characters, such a dictionary needs 200KB; the most space-efficient encoding for the hyphenation points demands another byte per word, so our baby dictionary is already 250KB. Even in this day and age, accessing a 20KB table, and performing a little computation is more efficient than accessing a 250KB table -- in fact it gets better as time goes along, simply because silicon hardware is getting faster more quickly than mass-storage devices. But these large dictionaries aren't large enough. English has a lot more words than 50K; so we really need a larger dictionary. How large? Well, we have to include derived forms (e.g. "demonstration" from "demonstrate") since the hyphens for derived forms CANNOT be derived easily from those for the root word (as I've said, hyphenation is harder than spelling!); compare "dem-on-stra-tion" with "de-mon-stra-tive". We also will need to store words built up by addition of suffixes and prefixes (how can we tell those that behave regularly from those that don't?). The Third Edition of Merriam-Webster's dictionary contains 450,000 words, admittedly including many compound nouns which probably don't need to be in a hyphenation dictionary, but also omitting many forms which ought to be included. Now we're talking 2.3 megabytes. I don't know how many words the full Oxford English Dictionary (surely the most comprehensive dictionary of any language ever assembled) includes, but I wouldn't be surprised if the figure were 2 or 3 million. So now we're up to 10 megabytes (perhaps more). Real estate on magnetic media is cheap, but not THAT cheap. And 10MB is not an amount of data that can be stored easily in RAM (and therefored accessed efficiently). Of course, one could stop at any stage, and either (a) give up, and demand that the user specify explicit hyphenations for everything that's left or (b) extend the dictionary with (horrors!) an algorithmic method of hyphenation. Either choice concedes that there's really something to algorithmic hyphenation after all. The second choice shows that the Knuth-Liang algorithm is one particular choice in a wide range of choices in a division of labor between a dictionary and an algorithm. But it is a careful choice, based on analysis, not on a haphazard gut feeling. It shows that a tiny dictionary in combination with a pattern-driven alogrithm can work, and work efficiently. In fact, the patterns that Liang has collected contain quite a bit of information about the structure of words in English not to be found by lookup in any dictionary. The last class of additions we must consider are coined words, some of which may derive from known words, and other which may not, and bits of common usage. Living languages, after all, are truly alive: they evolve and change, adding new words and new usages as time moves on. There is inevitably a time lag between the introduction of a new word, and its acceptance to the point it appears in a standardized dictionary. But new words do follow certain deep rules of a language; we may not understand these rules, or be able to enunciate them; but if we find an algorithm that captures the essence of these rules, we have found a real gem. Liang's algorithm is such a gem. As examples, I will consider coined words and nonsense words, the latter precisely because they sound as though they ought to be English (some such words, e.g. "chortle" = combination of "chuckle" and "snort", do eventually become full-fledged words). In particular, consider "yuppie" (coined), "quining" (from "quine", coined by Douglas Hofstadter in honor of the logician W. V. Quine), and "brillig" (nonsense word). They are not in any dictionary. But Liang's algorithm hyphenates them correctly [in my judgement]: "yup-pie", "quin-ing", and "bril-lig". As I said in my earlier posting, a great deal of knowledge has been accumulated over the course of many years' work on hyphenators for English. These efforts have culminated in a workable scheme. For related languages, I suspect the hardest part has already been done: finding an efficient, successful algorithm. One must construct the appropriate tables for the new target language, check the accuracy of the algorithm, and build up lists of exceptions which must be treated specially. Other general rules (the equivalent of "no discretionary hyphens may be placed in a hyphenated compound noun" and "do not hyphenate after the first character of a word" in English) must be formulated. For unrelated languages, the problem may as yet be unsolved. It might even be unsolvable, though I doubt that. Computers have evolved to the point where matters once beyond the "competence of computers" are amenable to algorithmic attack. I am specifically avoiding AI-ish approaches; these have failed more than they have succeeded, and they have given a bad name to any computation-based approach. For the hyphenation problem, this bad name is undeserved. The moral is that we should be looking for such algorithms for a variety of problems, not ranting and raving about how they can't work, and how primitive, knee-jerk solutions are "obviously" the only possible ones. David A. Kosower kosower@harvard.ARPA