Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!rutgers!rochester!PT.CS.CMU.EDU!SPEECH2.CS.CMU.EDU!kfl From: kfl@SPEECH2.CS.CMU.EDU (Kai-Fu Lee) Newsgroups: comp.ai,sci.lang Subject: Re: Measures of "Englishness"? Message-ID: <364@PT.CS.CMU.EDU> Date: Tue, 17-Nov-87 12:48:04 EST Article-I.D.: PT.364 Posted: Tue Nov 17 12:48:04 1987 Date-Received: Thu, 19-Nov-87 20:49:26 EST References: <32fordjm@byuvax.bitnet> Sender: netnews@PT.CS.CMU.EDU Organization: Carnegie-Mellon University, CS/RI Lines: 42 Xref: mnetor comp.ai:1131 sci.lang:1694 In article <32fordjm@byuvax.bitnet>, fordjm@byuvax.bitnet writes: > > Recently someone on the net commented on a program or method of rating > the "Englishness" of words according to the frequency of occurance of > various letters in sequence, etc. > I don't know anything about the said post. But you might be interested in the following article: Cave and Neuwirth, Hidden Markov Models for English, Proceedings of the Symposium on Appication of Hidden Markov Models to Text and Speech, Princeton, NJ 1980. Here's the editor's summary of the paper: L.P. Neuwirth discusses the application of hidden Markov analysis to English newspaper text (26 letters plus word space, without punctuation). This work showed that the technique is capable of automatically discovering linguistically important categorizations (e.g., vowels and consonants). Moreover, a calculation of the entropy of these models shows that some of them are stronger than the ordinary digraphic model, yet employ only half as many parameters. But one of the most interesting points, from a philosophical point of view, is the completely automatic nature of the process of obtaining the model: only the size of the state space, and a long example of English text, are give. No a priori structure of the state transition matrix, or of the output probabilities is assumed. Since hidden Markov models can be used for generation and recognition, it is possible to train a model for English, and "score" any previously unseen word with a probability that it was generated by the model for English. > Thanks in advance, > John M. Ford fordjm@byuvax.bitnet > 131 Starcrest Drive > Orem, UT 84058 > Kai-Fu Lee Computer Science Department Carnegie-Mellon University