Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!cs.utexas.edu!uunet!mcvax!hp4nl!botter!star.cs.vu.nl!ast@cs.vu.nl
From: ast@cs.vu.nl (Andy Tanenbaum)
Newsgroups: comp.os.minix
Subject: compression
Message-ID: <2888@ast.cs.vu.nl>
Date: 15 Jul 89 22:47:54 GMT
Sender: ast@cs.vu.nl
Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
Organization: VU Informatica, Amsterdam
Lines: 21


I have gone through the net postings and found a number of useful programs.
I have this fear that the next distribution is going to take 20 disks and cost
$149.  I have also noticed that compress only seems to win a factor of 2.
I wonder if better compression of C programs is possible.  In particular,
suppose we had a two pass compression program.  Pass 1 collected all the
strings in the program and their frequencies.  Pass 2 replaced the most
important string by ASCII code 128, the next most important one by 129 etc,
sort of like libpack.c does, only dynamically instead of using fixed strings.
Most important means that the product of # occurences times length is
maximum.  The compressed file would include regular ASCII characters 0-127,
and 120 or so characters denoting strings.  The last 8 could be for
1 tab, 2 tabs, 3 tabs, etc., and a two byte sequence for encoding runs of
blanks.  The dictionary would have to go too.  One would also have to think
carefully about what a string is, i.e., is "while" a string, or "while ("
a better choice?  It is my suspicion that such a program could compress
better than a factor of 2 on C programs.

Does anyone know of such a program, or want to write one?

Andy Tanenbaum (ast@cs.vu.nl)