Xref: utzoo comp.binaries.ibm.pc.d:1350 comp.sys.ibm.pc:21279 Path: utzoo!attcan!uunet!husc6!mailrus!cwjcc!neoucom!wtm From: wtm@neoucom.UUCP (Bill Mayhew) Newsgroups: comp.binaries.ibm.pc.d,comp.sys.ibm.pc Subject: Re: WARNING! Vicious bug in GSARC Summary: Huffman coding can produce bigger files than source Keywords: GSARC, Archivers, Compression, Bugs Message-ID: <1413@neoucom.UUCP> Date: 15 Nov 88 22:10:23 GMT References: <6627@watcgl.waterloo.edu> <42285@yale-celray.yale.UUCP> <6634@watcgl.waterloo.edu> Organization: Northeastern Ohio Universities College of Medicine Lines: 19 I don't know what GSARC uses for its coding scheme. If it uses huffman coding to compress ASCII files, the output can be bigger than the input on a short file. Huffman coding uses a variable number of bits to encode characters based upon the frequency with which characters appear. Characters that appear frequently are encoded with short bit patters, which infrequently encountered characters are longer. A table is prepended to such a file so that the decoding algorithm knows which is which. On a short file where all characters appear with about the same freuquency, huffman coding is inefficient. You are also penalized by the fact that the lookup table takes some space. Arcing a uuencoded file of a few K in length would possibly present such a situation. Unix compress, for instance, uses huffman coding. --Bill