Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!cs.utexas.edu!uunet!mcvax!hp4nl!botter!star.cs.vu.nl!ast@cs.vu.nl From: ast@cs.vu.nl (Andy Tanenbaum) Newsgroups: comp.os.minix Subject: compression Message-ID: <2888@ast.cs.vu.nl> Date: 15 Jul 89 22:47:54 GMT Sender: ast@cs.vu.nl Reply-To: ast@cs.vu.nl (Andy Tanenbaum) Organization: VU Informatica, Amsterdam Lines: 21 I have gone through the net postings and found a number of useful programs. I have this fear that the next distribution is going to take 20 disks and cost $149. I have also noticed that compress only seems to win a factor of 2. I wonder if better compression of C programs is possible. In particular, suppose we had a two pass compression program. Pass 1 collected all the strings in the program and their frequencies. Pass 2 replaced the most important string by ASCII code 128, the next most important one by 129 etc, sort of like libpack.c does, only dynamically instead of using fixed strings. Most important means that the product of # occurences times length is maximum. The compressed file would include regular ASCII characters 0-127, and 120 or so characters denoting strings. The last 8 could be for 1 tab, 2 tabs, 3 tabs, etc., and a two byte sequence for encoding runs of blanks. The dictionary would have to go too. One would also have to think carefully about what a string is, i.e., is "while" a string, or "while (" a better choice? It is my suspicion that such a program could compress better than a factor of 2 on C programs. Does anyone know of such a program, or want to write one? Andy Tanenbaum (ast@cs.vu.nl)