Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!linus!philabs!prls!pyramid!chronon!eric
From: eric@chronon.UUCP (Eric Black)
Newsgroups: net.mail,net.news
Subject: Re: Data compression to lower phone
Message-ID: <293@chronon.chronon.UUCP>
Date: Tue, 10-Jun-86 18:29:27 EDT
Article-I.D.: chronon.293
Posted: Tue Jun 10 18:29:27 1986
Date-Received: Thu, 12-Jun-86 00:31:36 EDT
References: <327@spdcc.UUCP> <8200002@nucsrl> <2369@phri.UUCP> <1015@k.cs.cmu.edu> <2197@peora.UUCP>
Reply-To: eric@chronon.UUCP (Eric Black)
Organization: Chronon Computer Corp., Mtn. View, CA
Lines: 51
Xref: linus net.mail:1573 net.news:4147
Summary: Compress bot headers and articles, but separately

In article <2197@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:
>> [... various suggestions to compress article bodies, not headers ...]
> [... discussion of how adaptive code compression works ...]
>
>So, as you can see, you only get "good" compression if the file is long
>enough for the program to see repeated sequences enough times for it to
>build up a code for the longer sequences.  Since the codes start out 9
>bits long, as long as the codes are for single characters you don't get
>any compression at all -- the file actually turns out bigger than it
>started out as.

It appears that the majority of news traffic consists of three basic
parts:   1) header information,  2) quoted excerpts (?) of other
articles, and 3) new text.

While the particular header on a particular article may be quite different
on a char-for-char basis than any other, a large number of article headers
taken as a whole would seem to be an excellent candidate for this method
of compression.

There is a tendency for articles to come out in clusters in response
to any given prior article, and the quantity of included quotes is not
only higher than it needs to be for most individual articles, but often
the same referential text is included in a large number of followup articles.
This also seems to provide good potential for this compression technique,
assuming that articles containing the same text are compressed together.

New text is usually some form of English, however technically-oriented
it may be, and provides some sort of character distribution as might
be expected from English.  This will benefit from compression even if
the exact text appears in only one article.

One solution might be, then, to separate the bodies of news articles
from the headers, and batch and compress them separately.  This brings
up all sorts of reliability and queueing issues (since the text and the
control information are sent separately), but would allow pass-through
of news articles while requiring decompressing, modifying, and recompressing
the headers only.  The relay site can peruse the batched headers to
determine if it is worth decompressing the article bodies for local
consumption.  Careful partitioning of the newsgroups into batches
could then reduce the cost ($$ and cycles/disks) to relay sites, so
that they may be less reluctant to continue downstream feeding of newsgroups
they don't consume themselves.

This partitioning is not an easy task; it is not clear that there is anything
that could be agreed upon as an optimal solution.  Not having to
recompress passed-through articles (even if they get decompressed
locally) should save some cycles, anyway.
-- 
Eric Black   "Garbage In, Gospel Out"
UUCP:        {sun,pyramid,hplabs,amdcad}!chronon!eric