Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!sdd.hp.com!uakari.primate.wisc.edu!aplcen!jhunix!hsu_wh
From: hsu_wh@jhunix.HCF.JHU.EDU (William H Hsu)
Newsgroups: comp.compression
Subject: Analyzing text files
Keywords: text compression
Message-ID: <8839@jhunix.HCF.JHU.EDU>
Date: 27 Jun 91 15:28:09 GMT
Organization: The Johns Hopkins University - HCF
Lines: 17


	Could someone point me in the direction of some code for fast
analysis of text files?  I am looking for C source to do this, or
bibliographic sources which discuss it.  I know there must be a lot of code
out there, because last year I saw 5 or more posted requests for 1 meg+ test
file samples for analysis.
	What I am trying to get is code which will scan a text file and
determine in minimal time whether it is normal English (or Roman alphabetic
text, i.e., French w/out non-ASCII characters), or a converted binary file
(e.g., BinHex'ed, uuencoded), or ANSI "graphics", or source code (if this is
sufficiently different to be distinguishable for English in a relatively
short amount of time).
	I understand that there is probably a significant performance
(accuracy) tradeoff a file size decreases, so for purposes of convenience,
perhaps it can be assumed that only files above 1 or 2K are analyzed.
	Does such code exist, and if so: where can one obtain it?  And what
is the fastest implementation?