Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!cbosgd!ihnp4!houxm!hjuxa!petsd!peora!jer From: jer@peora.UUCP (J. Eric Roskos) Newsgroups: net.mail,net.news Subject: Re: Data compression to lower phone Message-ID: <2197@peora.UUCP> Date: Fri, 6-Jun-86 09:06:50 EDT Article-I.D.: peora.2197 Posted: Fri Jun 6 09:06:50 1986 Date-Received: Sat, 7-Jun-86 14:27:39 EDT References: <327@spdcc.UUCP> <8200002@nucsrl> <2369@phri.UUCP> <1015@k.cs.cmu.edu> Organization: Concurrent Computer Corporation, Orlando, Fl Lines: 38 Xref: watmath net.mail:1664 net.news:4965 > OK, so don't compress the whole article, just the body and headers that you > don't have to look at or change. The headers that you need can be put at > the beginning and the compressed part can appear after a blank line. Well, the only problem with that is that the compression scheme always works on ordered pairs (p,n) where p is "something seen previously", and n is "the next character seen". The reason that the scheme works without having to carry along with it a translation table is that p can denote an arbitrarily long sequence of characters, but this sequence has to be built up by experience as the program sees these pairs. So, for example, the first time you see the sequence "abc", the compressed output would be a code for "a", a code for "b", and a code for "c", but it would also remember the pair ("a","b") and save a code for it in an internal table. So the next time it saw "abc", it would generate the code for ("a","b"), thus saving some bits over the separate codes for "a" and "b", and then would output the code for "c", and would remember (("a","b"),"c") in its table. Then the next time it saw "abc" it would just output the code for (("a","b"),"c"). So each time, the output would be a shorter number of bits than before for the same sequence. (The start of the code's bits in the output file is not byte aligned, incidentally, so you don't have any "wasted" bits if you have an odd number of bits like 9 or 10 in your code.) So, as you can see, you only get "good" compression if the file is long enough for the program to see repeated sequences enough times for it to build up a code for the longer sequences. Since the codes start out 9 bits long, as long as the codes are for single characters you don't get any compression at all -- the file actually turns out bigger than it started out as. Since most postings (except postings like mine that ramble on and on over some obscure topic :-)) are very short, you'd either have to build an index of a set of articles with the modified headers separate and concatenate the rest of the set of articles into one long file and compress that, or else the compression wouldn't work very well. -- E. Roskos "Winds allow other skylines to hold you."