Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!linus!philabs!prls!pyramid!chronon!eric From: eric@chronon.UUCP (Eric Black) Newsgroups: net.mail,net.news Subject: Re: Data compression to lower phone Message-ID: <293@chronon.chronon.UUCP> Date: Tue, 10-Jun-86 18:29:27 EDT Article-I.D.: chronon.293 Posted: Tue Jun 10 18:29:27 1986 Date-Received: Thu, 12-Jun-86 00:31:36 EDT References: <327@spdcc.UUCP> <8200002@nucsrl> <2369@phri.UUCP> <1015@k.cs.cmu.edu> <2197@peora.UUCP> Reply-To: eric@chronon.UUCP (Eric Black) Organization: Chronon Computer Corp., Mtn. View, CA Lines: 51 Xref: linus net.mail:1573 net.news:4147 Summary: Compress bot headers and articles, but separately In article <2197@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: >> [... various suggestions to compress article bodies, not headers ...] > [... discussion of how adaptive code compression works ...] > >So, as you can see, you only get "good" compression if the file is long >enough for the program to see repeated sequences enough times for it to >build up a code for the longer sequences. Since the codes start out 9 >bits long, as long as the codes are for single characters you don't get >any compression at all -- the file actually turns out bigger than it >started out as. It appears that the majority of news traffic consists of three basic parts: 1) header information, 2) quoted excerpts (?) of other articles, and 3) new text. While the particular header on a particular article may be quite different on a char-for-char basis than any other, a large number of article headers taken as a whole would seem to be an excellent candidate for this method of compression. There is a tendency for articles to come out in clusters in response to any given prior article, and the quantity of included quotes is not only higher than it needs to be for most individual articles, but often the same referential text is included in a large number of followup articles. This also seems to provide good potential for this compression technique, assuming that articles containing the same text are compressed together. New text is usually some form of English, however technically-oriented it may be, and provides some sort of character distribution as might be expected from English. This will benefit from compression even if the exact text appears in only one article. One solution might be, then, to separate the bodies of news articles from the headers, and batch and compress them separately. This brings up all sorts of reliability and queueing issues (since the text and the control information are sent separately), but would allow pass-through of news articles while requiring decompressing, modifying, and recompressing the headers only. The relay site can peruse the batched headers to determine if it is worth decompressing the article bodies for local consumption. Careful partitioning of the newsgroups into batches could then reduce the cost ($$ and cycles/disks) to relay sites, so that they may be less reluctant to continue downstream feeding of newsgroups they don't consume themselves. This partitioning is not an easy task; it is not clear that there is anything that could be agreed upon as an optimal solution. Not having to recompress passed-through articles (even if they get decompressed locally) should save some cycles, anyway. -- Eric Black "Garbage In, Gospel Out" UUCP: {sun,pyramid,hplabs,amdcad}!chronon!eric