Xref: utzoo sci.lang:2203 sci.crypt:1034 Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!ames!umd5!purdue!decwrl!ucbvax!agate!eris!doug From: doug@eris (Doug Merritt) Newsgroups: sci.lang,sci.crypt Subject: simple language statistics Message-ID: <9141@agate.BERKELEY.EDU> Date: 23 Apr 88 19:19:03 GMT Sender: usenet@agate.BERKELEY.EDU Reply-To: doug@eris.UUCP (Doug Merritt) Organization: University of California, Berkeley Lines: 39 I've written a program that categorizes files by the apparent language they're written in. So far it distinguishes English from non-english (e.g. scripts, software source, etc) simply by letter frequency, as derived by a simple statistical analysis I did. Some other languages are recognized by the presence or absence of non-latin alphabetical characters encoded in 8 bit ANSI (e.g. hexadecimal F1 represents an 'N' with a diacritical mark, which I presumptuously assume means the file is Spanish). I'd like to do a better job of recognition. In particular I'd like to be able to recognize common languages transcribed in ASCII/ANSI latin alphabet. To do this I need at least a letter frequency table for various languages (especially the Romance languages), and possibly a digram frequency table if it's absolutely necessary for accuracy. Similar information for 8 bit ANSI would also be useful (my system does not support multi-byte ANSI so that's not an issue; it *does* support 8-bit Icelandic, for instance, so I'm currently recognizing that, among others). Does anyone have such information online that they could mail me? Embedded in other software is fine, I can extract it. References to printed material also welcome; *especially* if there's a single source I could use to find the most frequent letters (or digrams) for, say, Spanish, German, French, Italian, Japanese, or any other languages that are frequently transcribed into the latin alphabet. The end product is a general purpose file identifier, like "file" on Unix (but a lot smarter), so any other, possibly bizarre, but easily recognizable languages not implied by the above description would also be of potential interest. Oh yeah, if anyone has ANSI standards documentation online that apply to any of this, that'd be great too. Thanks for any and all help! Doug Merritt doug@mica.berkeley.edu (ucbvax!mica!doug) or ucbvax!unisoft!certes!doug or sun.com!cup.portal.com!doug-merritt