Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site harvard.UUCP Path: utzoo!watmath!clyde!burl!ulysses!gatech!seismo!harvard!kosower From: kosower@harvard.UUCP (David A. Kosower) Newsgroups: net.internat Subject: Re: Hyphenation (mainly interactive) Message-ID: <529@harvard.UUCP> Date: Sun, 24-Nov-85 17:16:02 EST Article-I.D.: harvard.529 Posted: Sun Nov 24 17:16:02 1985 Date-Received: Tue, 26-Nov-85 08:26:30 EST References: <501@harvard.ARPA> <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP> <1090@enea.UUCP> Distribution: net Organization: Aiken Comp Lab, Harvard Lines: 82 In reply to the discussion presented by Sommarskog (<1090@enea.UUCP>), I have a few comments. The need for Sommarskog's formatter to resort to interactive hyphenation suggests to me that in fact the algorithm he uses is indeed somewhat inadequate, whether as a result of the error rate or the insufficient number of candidate hyphenation points found, he has not indicated. If the algorithm has an error rate (number of incorrect hyphens produced) that is too high, it is simply unacceptable. No amount of interactive fooling around is going to hide this from the user, or make it more palatable. What `too high' means here is application- dependent; I would tend to agree that my earlier figure of 5% as an upper limit is too lenient, and that the upper bound ought to be 1 or 2%. If the algorithm has a success rate (number of hyphens, out of all those possible) that is too low, it simply isn't very useful. In the case of Sommarskog's formatter, this means the user will have to construct library files for many words. It is *dumb*, *wasteful*, and *error-prone* to have each user do this independently; such a library, in any sane system, will be a system-wide resource. It will be done *once*, presumably by an outside company whose business is producing such files. Hmmm... haven't we seen such a creature before? Yes! It's our old friend the hyphenation dictionary! If the algorithm has a high success rate, then there's really no need for interactive hyphenation. One might ask the user about exceptions here, rather than requiring him to put them into his document or into a library file, though only at the price of disabling non-interactive running of the formatter even for valid input files; but there is no great advantage to doing so. Interactive hyphenation seems to be mostly a crutch for inadequate hyphenation algorithms. The solution is not to add interactive hyphenation to such a system, but to implement adequate hyphenation algorithms. Such algorithms are now *known* to exist, so there isn't really an excuse not to use them. Sommarskog further claims his method is language-independent. This is rubbish. Hyphenation `rules' in say, English, are complicated, and I doubt will be embodied, explicitly or implicitly, in any algorithmic approach that is not equivalent in complexity to the Knuth-Liang approach. Porting the Knuth-Liang algorithm to other languages does require a fair amount of work, as Sommarskog points out; but this work only has to be *once*, whereupon it is available to all who want to write in that language, the residual incremental effort on each user's part being negligible. As far as the question about `tillaga' is concerned (the word hyphenates `till-laga'), the answer is that although TeX will of course not hyphenate it automatically in this fashion (it doesn't speak Swedish either! :-)), it can be taught to do so. Although the following details are rather technical, I believe they are of sufficient interest for me to present them. Those who are familiar with TeX and who have read the relevant section of the TeXbook (by Donald E. Knuth, publ. by Addison-Wesley), please forgive my verbosity. In TeX, discretionary hyphens are specified using the `\discretionary' primitive. It takes three arguments: the pre-break text, the post-break text, and the no-break text. Thus, if one wanted to specify the hyphenation of `market' explicitly (though TeX already knows it), one would say: mar\discretionary{-}{}{}ket because it hyphenates as `mar-ket'. To hyphenate `tillaga', one could say: till\discretionary{-}{l}{}aga or til\discretionary{l-}{}{}laga One of these options is undoubtedly linguistically more correct, but they produce the same effect. The reason for the two options is that the mechanism can handle more difficult examples: one Knuth gives is the German word `backen' which hyphenates as `bak-ken'. One must specify this by ba\discretionary{k-}{k}{ck} English has a slightly more subtle version of the same problem, which has to do with ligatures; sometimes, a hyphen would split a ligature, e.g. `...ff...' would become `...f-f...'. Although this doesn't change the actual characters, it does change the character widths involved, and so the formatter must be cognizant of the fact (and TeX does know about such things). David A. Kosower kosower@harvard.ARPA