Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!samsung!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!spam!ross From: ross@spam.ua.oz.au (Ross Williams) Newsgroups: comp.compression Subject: How to code to a text file. Summary: How to code to a text file. Keywords: text file arithmetic coding hard coded wired range printable characters Message-ID: <550@spam.ua.oz> Date: 23 Mar 91 02:53:27 GMT Sender: ross@spam.ua.oz Followup-To: comp.compression Organization: Statistics, Pure & Applied Mathematics, University of Adelaide Lines: 82 Title: PAINLESS HARD-WIRED ARITHMETIC CODING. Subtitle: How to Code White Noise into Text. Author: Ross Williams. EMail: ross@spam.ua.oz.au Date: 23-Mar-1991. (Abstract: This short article describes a simple way to code a stream of random bits into a stream of text characters.) Text files are a very desirable way to represent information. Although they waste a little space, they seem to be the only things that can be transferred between computers without hassles. For this reason, I have considered the implications of a using text files as a medium for storing compressed data. I assume that only printable bytes [32,126] will be included in the text file, except for the end of line byte (10), which will appear at regular intervals to keep the line length down. Some programs (such as Kermit) have difficulty with long lines (e.g. >2K long). It is also prudent to exclude blanks from the range as some overenthusiastic programs may convert runs of blanks into TAB characters or delete leading or trailing blanks. In the end, we are left with a stream of bytes in the range [33,126]. This is 94 values, a particularly inconvenient range as it is not a power of two. In fact, it is not even close to a power of two, being almost exactly between 64 and 128. It misses out by 2 on being exactly half way and has lots of bits turned on (94=64+16+8+4+2=\p{1011110}). This makes it a very inconvenient range to code a stream of random bits (such as might be generated by a fast LZ technique). As might be expected, all the low powers of 94 are unpleasant numbers too. All in all, a very unpleasant range to code into. My first reaction was to say "No problems, lets slam in an arithmetic coder and everything will be OK" which indeed is what I decided on. However, your typical arithmetic coder ranges in implementation complexity somewhere between debugging AI programs in assembler and simulating a connection machine on a TRS-80. And they can be slow too. However, by hard wiring the arithmetic coding, the implementation can be much less painful than for a full arithmetic coder, which is why I'm bothering to tell you all this. To hardwire the arithmetic coding, form a mapping from the range [0,63] (6 bits exactly) onto the range [0,93] such that 34 of the 64 values map onto one value of the range [0,93] and 30 of the 64 values map onto two values. It doesn't matter exactly what this mapping is. Probably the simplest mapping is [0,33]-->[0,33] and [34,63]-->[34,93] (with 34-->(34,35), 35-->(36,37) etc). If desired, this mapping could be implemented using a lookup table. Now a completely random bit stream arising from a standard compression algorithm (e.g. unix compress or some other LZ algorithm), can be coded as follows. Take the next six bits of the data and map it. If the six-bit number maps to a single number, that number is transmitted. If the six bit number maps to a pair of numbers, the next input bit is used to determine which of the pair will be transmitted. The procedure is reversed at the receiving end. The efficiency of this scheme can be calculated as follows. The amount of information in a 94 symbol range is log_2(94)=6.555 bits. If the input bit stream is truly random (as it should be from a perfect compression algorithm), 6 bits will be transferred per text character 34/64 of the time, and 7 bits will be transferred per text character 30/64 of the time. This makes (34/64)*6+(30/64)*7=6.469 bits. This is an efficiency of 6.469/6.555=0.987 which is quite good. This scheme will be reasonably fast to implement, the main problem being all the bit manipulation. If you want it to go fast you can hard-wire the eight bit-alignment configurations into your program so as to deal with each alignment situation optimally. I have not actually implemented this scheme, but am likely to sometime in the next few months as part of a bigger project. I hope someone out there finds this simple scheme useful, Keep your entropy low, Ross Williams ross@spam.ua.oz.au