Xref: utzoo alt.sources.d:614 comp.sources.d:5560 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uwm.edu!bionet!snorkelwacker!bloom-beacon!eru!luth!sunic!dkuug!freja.diku.dk!skinfaxe.diku.dk!thorinn From: thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) Newsgroups: alt.sources.d,comp.sources.d Subject: Re: Unnecessary tar-compress-uuencodes Message-ID: <1990Jul10.182546.26487@diku.dk> Date: 10 Jul 90 18:25:46 GMT References: <15652@bfmny0.BFM.COM> Sender: news@diku.dk (The Netnews System) Organization: Department Of Computer Science, University Of Copenhagen Lines: 70 tneff@bfmny0.BFM.COM (Tom Neff) writes: >[Many good reasons not to tar-compress-uuencode source and other >plain text in news postings.] > > * Compressed newsfeeds, which already impart whatever transmission > efficiency gain LZW can offer, are circumvented and in fact > sandbagged by the pre-compression of data. That turns out not to be the case. It is true that a compressed file will usually expand if it is compressed again. But the intervening uuencode is very important: Compressing a uuencoded file is somewhat independent of compressing the original (*). I made an experiment with a tar of a directory tree with mixed source, binaries, and images. name size crummy ASCII graphics ---------- ------- --------------------- tar 4718592 tar ------- -60.3% ------> tar.Z | | tar.Z 1874378 +37.8% +37.8% | | tar.uu 6501192 V V tar.uu ------- -60.3% ------> tar.Z.uu tar.Z.uu 2582500 | | -63.2% -13.7% tar.uu.Z 2392701 | | V V tar.Z.uu.Z 2229065 tar.uu.Z ------- -6.8% ------> tar.Z.uu.Z Of course, compression factors will vary widely; I have made this experiment several times, with the same picture emerging: It pays to compress before uuencoding, and it pays to compress after, and it pays best to do both. In words: If you have to post uuencoded stuff (tar archives, images, whatever), COMPRESS them first. It is always better: In terms of storage on intermediate nodes and of transmission on non-compressed links it is very much better; it may not save much on compressed links, but it doesn't hurt (contrary to common assertions), and the small saving may still pay for the cost to run compress (and compress has less data to process, anyway, so it doesn't run for so long). I wish this misconception about the badness of compressed uuencoded data on compressed news links would go away; anyone for a news.config FAQ posting? ______________________________________________________________________ (*) An attempt at an explanation: The uuencode process maps the source bytes into a smaller set (64 symbols), and it maps three source bytes into four and puts in newlines. Compress works by finding common byte sequences and mapping them into symbols. A common source sequence will occur in three different ``phases'' after uuencode, and may be broken by newlines, so compress will not find it as easily. Of course, long sequences of identical bytes, as often in images, are immune to the shift effect. On the other hand, a 16-bit compress should be able to map all the 2-symbol uuencode sequences and about one fourth of the 3-symbol ones into a 16-bit symbol, giving a compression of about 12% on the uuencode of a totally random byte sequence. (Running compress after compress-uuencode usually gives between 11% and 14% compression, bearing this out; for this purpose, the first compress effectively gives a random sequence.) So: compress may get more of the ``available compression'' in a given input if it is run before uuencode. On the other hand, compress will be able to undo some of the expansion caused by uuencode, masking the first effect. -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcsun!diku!thorinn Institute of Datalogy -- we're scientists, not engineers. thorinn@diku.dk