Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!rutgers!rochester!PT.CS.CMU.EDU!SPEECH2.CS.CMU.EDU!kfl
From: kfl@SPEECH2.CS.CMU.EDU (Kai-Fu Lee)
Newsgroups: comp.ai,sci.lang
Subject: Re: Measures of "Englishness"?
Message-ID: <364@PT.CS.CMU.EDU>
Date: Tue, 17-Nov-87 12:48:04 EST
Article-I.D.: PT.364
Posted: Tue Nov 17 12:48:04 1987
Date-Received: Thu, 19-Nov-87 20:49:26 EST
References: <32fordjm@byuvax.bitnet>
Sender: netnews@PT.CS.CMU.EDU
Organization: Carnegie-Mellon University, CS/RI
Lines: 42
Xref: mnetor comp.ai:1131 sci.lang:1694

In article <32fordjm@byuvax.bitnet>, fordjm@byuvax.bitnet writes:
> 
>   Recently someone on the net commented on a program or method of rating
> the "Englishness" of words according to the frequency of occurance of
> various letters in sequence, etc.
>      

I don't know anything about the said post.  But you might be interested
in the following article: 
	Cave and Neuwirth, Hidden Markov Models for English, Proceedings
	of the Symposium on Appication of Hidden Markov Models to Text
	and Speech, Princeton, NJ 1980.

Here's the editor's summary of the paper:

	L.P. Neuwirth discusses the application of hidden Markov analysis to
	English newspaper text (26 letters plus word space, without 
	punctuation).  This work showed that the technique is capable 
	of automatically discovering linguistically important categorizations
	(e.g., vowels and consonants).  Moreover, a calculation of the
	entropy of these models shows that some of them are stronger than
	the ordinary digraphic model, yet employ only half as many parameters.
	But one of the most interesting points, from a philosophical point
	of view, is the completely automatic nature of the process of
	obtaining the model: only the size of the state space, and a
	long example of English text, are give.  No a priori structure of the 
	state transition matrix, or of the output probabilities is assumed.

Since hidden Markov models can be used for generation and recognition,
it is possible to train a model for English, and "score" any previously
unseen word with a probability that it was generated by the model for
English.

> Thanks in advance,
> John M. Ford               fordjm@byuvax.bitnet
> 131 Starcrest Drive
> Orem, UT 84058
> 

Kai-Fu Lee
Computer Science Department
Carnegie-Mellon University