Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!purdue!gatech!mcnc!rti!xyzzy!throopw@dg-rtp.dg.com From: throopw@dg-rtp.dg.com Newsgroups: comp.os.minix Subject: Re: Compress Message-ID: <8067@xyzzy.UUCP> Date: 23 Jul 89 03:37:01 GMT References: <4642@crash.cts.com> Sender: usenet@xyzzy.UUCP Lines: 40 > jdeitch@pnet01.cts.com (Jim Deitch) >> ast@cs.vu.nl (Andy Tanenbaum) >>suppose we had a two pass compression program. Pass 1 collected all the >>strings in the program and their frequencies. Pass 2 replaced the most >>important string by ASCII code 128, the next most important one by 129 etc, >>[...] [..also, what is a string..], i.e., is "while" a string, or "while (" >>a better choice? It is my suspicion that such a program could compress >>better than a factor of 2 on C programs. I wrote such a program for just such a purpose. However, I abandoned it when it failed to outperform 16-bit modified Lempel-Ziv compression (as embodied in the the widely distributed "compress" program), at least for the cases I tried it on. I also fooled around with quite a bit of choices of how to break the input stream into "words" or "tokens". The most effective seemed to be to use alphabetic to non-alphabetic (or vice versa) transitions for the token boundaries. Now, I was trying to compress a news stream, and my compression scheme came *close* to 16-bit compression, hovering around 1.8-to-1. It might do better on C programs as opposed to English text interspersed with headers. After all, C programs have a smaller and more repetitive vocabulary than does standard English (or even Net-ese). If anybody is interested, I can send my prototype code along to them (though I'm not terribly proud of it... it was just an afternoon's experiment in hopes of getting 3 or 5 -to-1 news compression that didn't work out). > Why not use a 16 bit compression? I have a version that will compile on the > XT, AT under dos, and supports 16 bit. Also, has anyone tried to port the > Atari 16 bit that went through the net awhile ago to the IBM? I presume the reason is that it won't fit into a minix address space on the versions of minix for IBM-PC descended machines. However, as I wrote my version of token-based compression, it shared the Lempel-Ziv algorithm's taste for a large working set (I stored the "dictionary" in memory). This could be improved upon I suppose. -- Wayne Throop !mcnc!rti!xyzzy!throopw