Xref: utzoo sci.lang:2214 sci.crypt:1036
Path: utzoo!mnetor!uunet!husc6!uwvax!heurikon!lampman
From: lampman@heurikon.UUCP (Ray Lampman)
Newsgroups: sci.lang,sci.crypt
Subject: Re: simple language statistics
Message-ID: <201@heurikon.UUCP>
Date: 25 Apr 88 21:06:11 GMT
References: <9141@agate.BERKELEY.EDU>
Reply-To: lampman@heurikon.UUCP (Ray Lampman)
Followup-To: sci.lang
Organization: Heurikon Corp., Madison WI
Lines: 22

__________________________________________________________________________

In article <9141@agate.BERKELEY.EDU> doug@eris.UUCP (Doug Merritt) writes:
| I've written a program that categorizes files by the apparent language
| they're written in.
__________________________________________________________________________

I will be interested in this tool when you are satisfied it is complete.
Please send mail or post news, others may be interested as well.

Given enough example texts for letter, digram, or trigram frequencies, it
should be possible to identify just about any language. Will your program
recognize computer as well as human languages?

How about providing a way of `teaching' your program about languages it
does not yet recognize? If I can provide a sample text of my favorite
language `X', can your program assimilate the sample and recognize other
samples of the same language? What should the program do if there is no
statistical difference between a language it already `knows' and a new
one you are trying to teach it? Hope some of this is useful,
-- 
                                        - Ray Lampman (lampman@heurikon.UUCP)