Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!sdd.hp.com!uakari.primate.wisc.edu!aplcen!jhunix!hsu_wh From: hsu_wh@jhunix.HCF.JHU.EDU (William H Hsu) Newsgroups: comp.compression Subject: Analyzing text files Keywords: text compression Message-ID: <8839@jhunix.HCF.JHU.EDU> Date: 27 Jun 91 15:28:09 GMT Organization: The Johns Hopkins University - HCF Lines: 17 Could someone point me in the direction of some code for fast analysis of text files? I am looking for C source to do this, or bibliographic sources which discuss it. I know there must be a lot of code out there, because last year I saw 5 or more posted requests for 1 meg+ test file samples for analysis. What I am trying to get is code which will scan a text file and determine in minimal time whether it is normal English (or Roman alphabetic text, i.e., French w/out non-ASCII characters), or a converted binary file (e.g., BinHex'ed, uuencoded), or ANSI "graphics", or source code (if this is sufficiently different to be distinguishable for English in a relatively short amount of time). I understand that there is probably a significant performance (accuracy) tradeoff a file size decreases, so for purposes of convenience, perhaps it can be assumed that only files above 1 or 2K are analyzed. Does such code exist, and if so: where can one obtain it? And what is the fastest implementation?