Xref: utzoo comp.binaries.ibm.pc.d:1350 comp.sys.ibm.pc:21279
Path: utzoo!attcan!uunet!husc6!mailrus!cwjcc!neoucom!wtm
From: wtm@neoucom.UUCP (Bill Mayhew)
Newsgroups: comp.binaries.ibm.pc.d,comp.sys.ibm.pc
Subject: Re: WARNING! Vicious bug in GSARC
Summary: Huffman coding can produce bigger files than source
Keywords: GSARC, Archivers, Compression, Bugs
Message-ID: <1413@neoucom.UUCP>
Date: 15 Nov 88 22:10:23 GMT
References: <6627@watcgl.waterloo.edu> <42285@yale-celray.yale.UUCP> <6634@watcgl.waterloo.edu>
Organization: Northeastern Ohio Universities College of Medicine
Lines: 19


I don't know what GSARC uses for its coding scheme.  If it uses
huffman coding to compress ASCII files, the output can be bigger
than the input on a short file.  Huffman coding uses a variable
number of bits to encode characters based upon the frequency with
which characters appear.  Characters that appear frequently are
encoded with short bit patters, which infrequently encountered
characters are longer.  A table is prepended to such a file so that
the decoding algorithm knows which is which.

On a short file where all characters appear with about the same
freuquency, huffman coding is inefficient.  You are also penalized
by the fact that the lookup table takes some space.  Arcing a
uuencoded file of a few K in length would possibly present such a
situation.

Unix compress, for instance, uses huffman coding.

--Bill