Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP
Path: utzoo!watmath!clyde!bonnie!akgua!gatech!seismo!cmcl2!philabs!pwa-b!mmintl!franka
From: franka@mmintl.UUCP (Frank Adams)
Newsgroups: net.internat
Subject: Re: Hyphenation (Long message)
Message-ID: <795@mmintl.UUCP>
Date: Sat, 16-Nov-85 01:19:41 EST
Article-I.D.: mmintl.795
Posted: Sat Nov 16 01:19:41 1985
Date-Received: Wed, 20-Nov-85 01:08:01 EST
References: <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP> <501@harvard.ARPA>
Reply-To: franka@mmintl.UUCP (Frank Adams)
Distribution: net
Organization: Multimate International, E. Hartford, CT
Lines: 45


In article <501@harvard.ARPA> kosower@harvard.ARPA writes:
>Most native speakers will probably hyphenate
>at least a fair percentage of words by... looking them up in
>a printed dictionary.

Actually, I think most native speakers will put a hyphen in in a place
where they are reasonably sure one belongs, and will acheive a rather
high success rate at doing so.  I do agree that fully interactive
hyphenation is unacceptable.  However, a reasonably sized dictionary,
with resort to interaction instead of to an algorithm, seems to me
to be a viable option in many cases.  From experience, I would say
that most words not found in a 30,000 or so word dictionary are proper
nouns, and not likely to be found even in a much larger dictionary.

>In fact, there are three significant numbers about any hyphenation
>mechanism ("mechanism" here includes dictionary lookup):
>
>   o  The percentage of incorrect hyphenations it produces.
>
>   o  The percentage of all possible hyphenations that it actually
>      finds.
>
>   o  Its efficiency.
>
>Both of the first two numbers should of course be measured for realistic
>text samples, i.e. they should weighted for REALISTIC frequencies
>of word appearances.  We want the first number to be as close to
>zero as possible, and the second number to be as close to 100%
>as possible.  But while we would probably not tolerate a percentage
>of incorrect hyphens greater than about 5% (remember that hyphenation
>isn't all that frequent in most documents, so this already amounts
>to a rather infrequent error), we might well tolerate an algorithm
>that produces signficantly less than 100% of all possible hyphens,
>especially if the hyphens it does find break the word up into
>small enough chunks; I would estimate that a figure as low as 70 to
>80% might be acceptable here.

I would quibble with these figures.  I think you want the first number
under 1% for a general purpose algorithm.  On the other hand, I think
even 50% is quite adequate for the second.  Since the Knuth-Liang
algorithm [description in original article not quoted here] apparently
meets these criteria, I will withdraw my claim.

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108