Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!cbosgd!ihnp4!houxm!hjuxa!petsd!peora!jer
From: jer@peora.UUCP (J. Eric Roskos)
Newsgroups: net.mail,net.news
Subject: Re: Data compression to lower phone
Message-ID: <2197@peora.UUCP>
Date: Fri, 6-Jun-86 09:06:50 EDT
Article-I.D.: peora.2197
Posted: Fri Jun  6 09:06:50 1986
Date-Received: Sat, 7-Jun-86 14:27:39 EDT
References: <327@spdcc.UUCP> <8200002@nucsrl> <2369@phri.UUCP> <1015@k.cs.cmu.edu>
Organization: Concurrent Computer Corporation, Orlando, Fl
Lines: 38
Xref: watmath net.mail:1664 net.news:4965

> OK, so don't compress the whole article, just the body and headers that you
> don't have to look at or change.  The headers that you need can be put at
> the beginning and the compressed part can appear after a blank line.

Well, the only problem with that is that the compression scheme always
works on ordered pairs (p,n) where p is "something seen previously", and n
is "the next character seen".  The reason that the scheme works without
having to carry along with it a translation table is that p can denote an
arbitrarily long sequence of characters, but this sequence has to be built
up by experience as the program sees these pairs.  So, for example, the
first time you see the sequence "abc", the compressed output would be a
code for "a", a code for "b", and a code for "c", but it would also
remember the pair ("a","b") and save a code for it in an internal table.
So the next time it saw "abc", it would generate the code for ("a","b"),
thus saving some bits over the separate codes for "a" and "b", and then
would output the code for "c", and would remember (("a","b"),"c") in its
table.  Then the next time it saw "abc" it would just output the code for
(("a","b"),"c").  So each time, the output would be a shorter number of
bits than before for the same sequence.  (The start of the code's bits
in the output file is not byte aligned, incidentally, so you don't have
any "wasted" bits if you have an odd number of bits like 9 or 10 in your
code.)

So, as you can see, you only get "good" compression if the file is long
enough for the program to see repeated sequences enough times for it to
build up a code for the longer sequences.  Since the codes start out 9
bits long, as long as the codes are for single characters you don't get
any compression at all -- the file actually turns out bigger than it
started out as.

Since most postings (except postings like mine that ramble on and on over
some obscure topic :-)) are very short, you'd either have to build an
index of a set of articles with the modified headers separate and concatenate
the rest of the set of articles into one long file and compress that, or
else the compression wouldn't work very well.
-- 
E. Roskos
"Winds allow other skylines to hold you."