Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!purdue!gatech!mcnc!rti!xyzzy!throopw@dg-rtp.dg.com
From: throopw@dg-rtp.dg.com
Newsgroups: comp.os.minix
Subject: Re: Compress
Message-ID: <8067@xyzzy.UUCP>
Date: 23 Jul 89 03:37:01 GMT
References: <4642@crash.cts.com>
Sender: usenet@xyzzy.UUCP
Lines: 40

> jdeitch@pnet01.cts.com (Jim Deitch)
>> ast@cs.vu.nl (Andy Tanenbaum)
>>suppose we had a two pass compression program.  Pass 1 collected all the
>>strings in the program and their frequencies.  Pass 2 replaced the most
>>important string by ASCII code 128, the next most important one by 129 etc,
>>[...] [..also, what is a string..], i.e., is "while" a string, or "while ("
>>a better choice?  It is my suspicion that such a program could compress
>>better than a factor of 2 on C programs.

I wrote such a program for just such a purpose.  However, I abandoned it
when it failed to outperform 16-bit modified Lempel-Ziv compression
(as embodied in the the widely distributed "compress" program), at least
for the cases I tried it on.

I also fooled around with quite a bit of choices of how to break the
input stream into "words" or "tokens".  The most effective seemed to be
to use alphabetic to non-alphabetic (or vice versa) transitions for the
token boundaries.  Now, I was trying to compress a news stream, and my
compression scheme came *close* to 16-bit compression, hovering around
1.8-to-1.  It might do better on C programs as opposed to English text
interspersed with headers.  After all, C programs have a smaller and
more repetitive vocabulary than does standard English (or even Net-ese). 

If anybody is interested, I can send my prototype code along to them
(though I'm not terribly proud of it...  it was just an afternoon's
experiment in hopes of getting 3 or 5 -to-1 news compression that didn't
work out). 


> Why not use a 16 bit compression?  I have a version that will compile on the
> XT, AT under dos, and supports 16 bit.  Also, has anyone tried to port the
> Atari 16 bit that went through the net awhile ago to the IBM?

I presume the reason is that it won't fit into a minix address space on
the versions of minix for IBM-PC descended machines.  However, as I
wrote my version of token-based compression, it shared the Lempel-Ziv
algorithm's taste for a large working set (I stored the "dictionary" in
memory).  This could be improved upon I suppose.

--
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw