Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!usc!elroy.jpl.nasa.gov!ames!lll-winken!uunet!philmtl!philabs!fnh From: fnh@coyote.philips.com (Fletcher Holmquist;6483;4.85;$0351) Newsgroups: comp.os.minix Subject: Re: compression Message-ID: Date: 20 Jul 89 14:30:24 GMT References: <2888@ast.cs.vu.nl> <1989Jul18.174647.19537@utzoo.uucp> <2908@ast.cs.vu.nl> <576@prles2.UUCP> Sender: news@philabs.Philips.Com Organization: Philips Laboratories - Briarcliff Lines: 25 In-reply-to: meulenbr@cstw01.prl.philips.nl's message of 19 Jul 89 13:00:52 GMT I have done some preliminary tests along the lines of ast's suggestions. By replacing recurring strings by a three character token, I reduced much of the GNU C source by about 36%. The resulting file is printable ASCII text, and compress can is used as a follow-up with good results. The token replacement seems to be quite worthwhile. My preferred solution is a Huffman encoding of sequences, with slots for keywords, identifiers, macro names, and common C sequences (", ", "{\n", "/*", }. This could be a static list to make make things quick. There would also be slots for the tokens as used above for recurring strings not already recognized. The process has two passes, one to hit the strings and one to produce a Huffman code. The savings for C code with <20% comments should be 60-80%. I am writing an initial version now, and will mail things to ast when done. If others are working the problem, please send me mail to see if we can combine the good points, or at least argue. fletch (fnh@philabs.philips.com) -- fletch holmquist robotics and flexible automation fnh@philabs.philips.com philips laboratories (914) 945-6483 345 scarborough road briarcliff manor, ny, 10510