Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP Path: utzoo!watmath!clyde!bonnie!akgua!gatech!seismo!cmcl2!philabs!pwa-b!mmintl!franka From: franka@mmintl.UUCP (Frank Adams) Newsgroups: net.internat Subject: Re: Hyphenation (Long message) Message-ID: <795@mmintl.UUCP> Date: Sat, 16-Nov-85 01:19:41 EST Article-I.D.: mmintl.795 Posted: Sat Nov 16 01:19:41 1985 Date-Received: Wed, 20-Nov-85 01:08:01 EST References: <471@harvard.ARPA> <773@mmintl.UUCP> <968@enea.UUCP> <501@harvard.ARPA> Reply-To: franka@mmintl.UUCP (Frank Adams) Distribution: net Organization: Multimate International, E. Hartford, CT Lines: 45 In article <501@harvard.ARPA> kosower@harvard.ARPA writes: >Most native speakers will probably hyphenate >at least a fair percentage of words by... looking them up in >a printed dictionary. Actually, I think most native speakers will put a hyphen in in a place where they are reasonably sure one belongs, and will acheive a rather high success rate at doing so. I do agree that fully interactive hyphenation is unacceptable. However, a reasonably sized dictionary, with resort to interaction instead of to an algorithm, seems to me to be a viable option in many cases. From experience, I would say that most words not found in a 30,000 or so word dictionary are proper nouns, and not likely to be found even in a much larger dictionary. >In fact, there are three significant numbers about any hyphenation >mechanism ("mechanism" here includes dictionary lookup): > > o The percentage of incorrect hyphenations it produces. > > o The percentage of all possible hyphenations that it actually > finds. > > o Its efficiency. > >Both of the first two numbers should of course be measured for realistic >text samples, i.e. they should weighted for REALISTIC frequencies >of word appearances. We want the first number to be as close to >zero as possible, and the second number to be as close to 100% >as possible. But while we would probably not tolerate a percentage >of incorrect hyphens greater than about 5% (remember that hyphenation >isn't all that frequent in most documents, so this already amounts >to a rather infrequent error), we might well tolerate an algorithm >that produces signficantly less than 100% of all possible hyphens, >especially if the hyphens it does find break the word up into >small enough chunks; I would estimate that a figure as low as 70 to >80% might be acceptable here. I would quibble with these figures. I think you want the first number under 1% for a general purpose algorithm. On the other hand, I think even 50% is quite adequate for the second. Since the Knuth-Liang algorithm [description in original article not quoted here] apparently meets these criteria, I will withdraw my claim. Frank Adams ihpn4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108