Xref: utzoo sci.lang:2214 sci.crypt:1036 Path: utzoo!mnetor!uunet!husc6!uwvax!heurikon!lampman From: lampman@heurikon.UUCP (Ray Lampman) Newsgroups: sci.lang,sci.crypt Subject: Re: simple language statistics Message-ID: <201@heurikon.UUCP> Date: 25 Apr 88 21:06:11 GMT References: <9141@agate.BERKELEY.EDU> Reply-To: lampman@heurikon.UUCP (Ray Lampman) Followup-To: sci.lang Organization: Heurikon Corp., Madison WI Lines: 22 __________________________________________________________________________ In article <9141@agate.BERKELEY.EDU> doug@eris.UUCP (Doug Merritt) writes: | I've written a program that categorizes files by the apparent language | they're written in. __________________________________________________________________________ I will be interested in this tool when you are satisfied it is complete. Please send mail or post news, others may be interested as well. Given enough example texts for letter, digram, or trigram frequencies, it should be possible to identify just about any language. Will your program recognize computer as well as human languages? How about providing a way of `teaching' your program about languages it does not yet recognize? If I can provide a sample text of my favorite language `X', can your program assimilate the sample and recognize other samples of the same language? What should the program do if there is no statistical difference between a language it already `knows' and a new one you are trying to teach it? Hope some of this is useful, -- - Ray Lampman (lampman@heurikon.UUCP)