Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site harvard.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!gatech!seismo!harvard!kosower
From: kosower@harvard.UUCP (David A. Kosower)
Newsgroups: net.internat
Subject: Re: Hyphenation (mainly interactive)
Message-ID: <529@harvard.UUCP>
Date: Sun, 24-Nov-85 17:16:02 EST
Article-I.D.: harvard.529
Posted: Sun Nov 24 17:16:02 1985
Date-Received: Tue, 26-Nov-85 08:26:30 EST
References: <501@harvard.ARPA> <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP> <1090@enea.UUCP>
Distribution: net
Organization: Aiken Comp Lab, Harvard
Lines: 82


   In reply to the discussion presented by Sommarskog (<1090@enea.UUCP>),
I have a few comments.   The need for Sommarskog's
formatter to resort to interactive hyphenation suggests to me that
in fact the algorithm he uses is indeed somewhat inadequate, whether
as a result of the error rate or the insufficient number of candidate
hyphenation points found, he has not indicated.

   If the algorithm has an error rate (number of incorrect hyphens
produced) that is too high, it is simply unacceptable.  No amount
of interactive fooling around is going to hide this from the user,
or make it more palatable.  What `too high' means here is application-
dependent;  I would tend to agree that my earlier figure of 5% as
an upper limit is too lenient, and that the upper bound ought to be
1 or 2%.
   If the algorithm has a success rate (number of hyphens, out of
all those possible) that is too low, it simply isn't very useful.
In the case of Sommarskog's formatter, this means the user will have
to construct library files for many words.  It is *dumb*, *wasteful*,
and *error-prone* to have each user do this independently; such a 
library, in any sane system, will be a system-wide resource.  It will 
be done *once*, presumably by an outside company whose business is 
producing such files.  Hmmm... haven't we seen such a creature before?
Yes!  It's our old friend the hyphenation dictionary!
   If the algorithm has a high success rate, then there's really no need
for interactive hyphenation.  One might ask the user about exceptions here,
rather than requiring him to put them into his document or into
a library file, though only at the price of disabling non-interactive
running of the formatter even for valid input files; but there is no
great advantage to doing so.
   Interactive hyphenation seems to be mostly a crutch for inadequate
hyphenation algorithms.  The solution is not to add interactive
hyphenation to such a system, but to implement adequate hyphenation
algorithms.  Such algorithms are now *known* to exist, so there isn't
really an excuse not to use them.

   Sommarskog further claims his method is language-independent.  This is
rubbish.  Hyphenation `rules' in say, English, are complicated, and
I doubt will be embodied, explicitly or implicitly, in any algorithmic
approach that is not equivalent in complexity to the Knuth-Liang 
approach.  Porting the Knuth-Liang algorithm to other languages does
require a fair amount of work, as Sommarskog points out; but this
work only has to be *once*, whereupon it is available to all who want
to write in that language, the residual incremental effort on each
user's part being negligible.

   As far as the question about `tillaga' is concerned (the word
hyphenates `till-laga'), the answer is that although TeX will of course
not hyphenate it automatically in this fashion (it doesn't speak
Swedish either! :-)), it can be taught to do so.  Although the
following details are rather technical, I believe they are of
sufficient interest for me to present them.  Those who are familiar
with TeX and who have read the relevant section of the TeXbook
(by Donald E. Knuth, publ. by Addison-Wesley), please forgive my
verbosity.
   In TeX, discretionary hyphens are specified using the
`\discretionary' primitive.  It takes three arguments:  the pre-break
text, the post-break text, and the no-break text.  Thus, if one
wanted to specify the hyphenation of `market' explicitly (though TeX
already knows it), one would say:
mar\discretionary{-}{}{}ket
because it hyphenates as `mar-ket'.  To hyphenate `tillaga', one could
say:
       till\discretionary{-}{l}{}aga
  or
       til\discretionary{l-}{}{}laga
One of these options is undoubtedly linguistically more correct, but
they produce the same effect.  The reason for the two options is that
the mechanism can handle more difficult examples: one Knuth gives
is the German word `backen' which hyphenates as `bak-ken'.  One
must specify this by
       ba\discretionary{k-}{k}{ck}

  English has a slightly more subtle version of the same problem, which
has to do with ligatures; sometimes, a hyphen would split a ligature, e.g.
`...ff...' would become `...f-f...'.  Although this doesn't change
the actual characters, it does change the character widths involved,
and so the formatter must be cognizant of the fact (and TeX does 
know about such things).

                                    David A. Kosower
                                    kosower@harvard.ARPA