Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!samsung!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!spam!ross
From: ross@spam.ua.oz.au (Ross Williams)
Newsgroups: comp.compression
Subject: How to code to a text file.
Summary: How to code to a text file.
Keywords: text file arithmetic coding hard coded wired range printable characters
Message-ID: <550@spam.ua.oz>
Date: 23 Mar 91 02:53:27 GMT
Sender: ross@spam.ua.oz
Followup-To: comp.compression
Organization: Statistics, Pure & Applied Mathematics, University of Adelaide
Lines: 82


Title: PAINLESS HARD-WIRED ARITHMETIC CODING.
Subtitle: How to Code White Noise into Text.
Author: Ross Williams.
EMail: ross@spam.ua.oz.au
Date: 23-Mar-1991.

(Abstract: This short article describes a simple way to code a stream
of random bits into a stream of text characters.)


Text files are a very desirable way to represent information. Although
they waste a little space, they seem to be the only things that can be
transferred between computers without hassles. For this reason, I have
considered the implications of a using text files as a medium for
storing compressed data.

I assume that only printable bytes [32,126] will be included in the
text file, except for the end of line byte (10), which will appear at
regular intervals to keep the line length down. Some programs (such as
Kermit) have difficulty with long lines (e.g. >2K long). It is also
prudent to exclude blanks from the range as some overenthusiastic
programs may convert runs of blanks into TAB characters or delete
leading or trailing blanks.

In the end, we are left with a stream of bytes in the range [33,126].
This is 94 values, a particularly inconvenient range as it is not a
power of two. In fact, it is not even close to a power of two, being
almost exactly between 64 and 128. It misses out by 2 on being exactly
half way and has lots of bits turned on (94=64+16+8+4+2=\p{1011110}).
This makes it a very inconvenient range to code a stream of random
bits (such as might be generated by a fast LZ technique). As might be
expected, all the low powers of 94 are unpleasant numbers too. All in
all, a very unpleasant range to code into.

My first reaction was to say "No problems, lets slam in an arithmetic
coder and everything will be OK" which indeed is what I decided on.
However, your typical arithmetic coder ranges in implementation
complexity somewhere between debugging AI programs in assembler and
simulating a connection machine on a TRS-80. And they can be slow too.
However, by hard wiring the arithmetic coding, the implementation can
be much less painful than for a full arithmetic coder, which is why
I'm bothering to tell you all this.

To hardwire the arithmetic coding, form a mapping from the range
[0,63] (6 bits exactly) onto the range [0,93] such that 34 of the 64
values map onto one value of the range [0,93] and 30 of the 64 values
map onto two values. It doesn't matter exactly what this mapping is.
Probably the simplest mapping is [0,33]-->[0,33] and [34,63]-->[34,93]
(with 34-->(34,35), 35-->(36,37) etc). If desired, this mapping could
be implemented using a lookup table.

Now a completely random bit stream arising from a standard compression
algorithm (e.g. unix compress or some other LZ algorithm), can be
coded as follows. Take the next six bits of the data and map it. If
the six-bit number maps to a single number, that number is
transmitted. If the six bit number maps to a pair of numbers, the next
input bit is used to determine which of the pair will be transmitted.
The procedure is reversed at the receiving end.

The efficiency of this scheme can be calculated as follows. The amount
of information in a 94 symbol range is log_2(94)=6.555 bits. If the
input bit stream is truly random (as it should be from a perfect
compression algorithm), 6 bits will be transferred per text character
34/64 of the time, and 7 bits will be transferred per text character
30/64 of the time. This makes (34/64)*6+(30/64)*7=6.469 bits. This is
an efficiency of 6.469/6.555=0.987 which is quite good.

This scheme will be reasonably fast to implement, the main problem
being all the bit manipulation. If you want it to go fast you can
hard-wire the eight bit-alignment configurations into your program so
as to deal with each alignment situation optimally.

I have not actually implemented this scheme, but am likely to sometime
in the next few months as part of a bigger project.

I hope someone out there finds this simple scheme useful,

Keep your entropy low,

Ross Williams
ross@spam.ua.oz.au