Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!samsung!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!spam!ross From: ross@spam.ua.oz.au (Ross Williams) Newsgroups: comp.compression Subject: Justification of proposed interface standard. Summary: Justification of proposed interface standard. Keywords: data compression interface standard justification Message-ID: <860@spam.ua.oz> Date: 20 Jun 91 15:49:15 GMT Sender: ross@spam.ua.oz Followup-To: comp.compression Organization: Statistics, Pure & Applied Mathematics, University of Adelaide Lines: 397 NOTES ON THE DESIGN OF THE PROPOSED DATA COMPRESSION INTERFACE STANDARD ----------------------------------------------------------------------- Author : Ross Williams. Date : 20-Jun-1991. CONTENTS 1. Introduction 2. Typical Applications 3. Control Structure 4. Input Structure 5. Flushing 6. Buffering 7. Definition of an Algorithm 8. Working Memory 9. Data Format 10. Generalizing the Interface 11. Final Note 1. Introduction --------------- This document attempts to justify a selection of design decisions that determined the proposed data compression interface standard. 2. Typical Applications ----------------------- The following typical applications of data compression serve to ground the standard in the practical. The standard was framed with these applications in mind. UNIX FILTER COMPRESSOR: A Unix data compression program reads a stream of bytes from standard input with no foreknowledge of when the stream is going to end, and writes a compressed representation of the stream of bytes to standard output. The compression algorithm is implemented as a process which has complete control over its control flow and IO. MEMORY BLOCK COMPRESSOR: A word processing program divides the text being edited into blocks and stores each block in compressed form, decompressing it on demand. The data compression algorithm is fed a block of bytes and is expected to produce a compressed representation of that block. MODEM COMPRESSOR: An error correcting modem employs a data compression algorithm implemented as an abstract data type. Sometimes, when screens of data are to be transmitted, a block of 2K is presented for compression. At other times, when a user keystroke has to be transmitted without delay, a single byte is presented for compression. In both cases, the resulting representation must be coded in the context of the previous bytes compressed. The modem instructs the algorithm to forgets its context between connect sessions. BUFFERED FILE COMPRESSOR: A file compression program compresses a set of files independently. It divides each file into blocks of 50K and feeds the blocks in succession to a data compression algorithm implemented as an abstract data type. The algorithm must compress each block in the context of previous blocks transmitted. 3. Control Structure -------------------- The applications described above differ widely in the degree of control and data retention capability given to the data compression algorithm. Control ranges from tightest to loosest as follows: * Procedural Example: Block. * Abstract data type Example: Modem,File. * Process Example: Unix. In defining a standard, a particular level of control had to be decided upon. This was a fairly easy decision. Procedural organizations do not allow a data compression algorithm to retain information between invocations which makes them unsuitable for context stream applications. Process organizations are very general but are messy and awkward to implement in languages with only one thread of control (most languages). In choosing a control structure the abstract data type structure is the clear winner. An ADT structure allows context to be preserved but does not wrench control from the user program. 4. Input Structure ------------------ The decision made in section 3 implies that the user program will create blocks of data and invoke a compression procedure to compress the data. The compression procedure will be passed a block of memory which it may use to create a model of the data and to pass information (such as the model) between invocations. The next major design decisions involves the imposition of structure on the sequence of data blocks. * CONTEXT : When should the algorithm ignore its model and start afresh? * FLUSHING : When should the algorithm complete a decompressible block? * CHUNKS : What should be the size of data blocks fed to the algorithm? By "ignore its model" is meant cease to use any knowledge about the data already seen to compress the data about to be seen. By flushing is meant writing zero or more output bytes to complete the representation of the bytes already received for compression. In the general case, the processing of a stream of blocks of input data can be modelled as follows: The input data consists of Zero or more CONTEXT blocks consisting of One or more CLOSED blocks consisting of One or more PIECE blocks consisting of Zero or more bytes. Context Blocks +----------------------------------------+ +------------------------+ | Closed Blocks | | (similar over | | +----------------+ +-----------+ +-+ | | here too) | | | Piece Blocks | | | | | | | | | | +--+ +--+ +--+ | | +--+ +--+ | | | | | | | | | | | | | | | | | | | | | | | | | | -----> ^ ^ ^ ^ ^ <--- "Calls" to data compressor. Input & ^Flush coder buffer ^Destroy model Time Piece blocks consist of zero or more bytes. The algorithm reads each piece block and writes zero or more bytes to the output in response. At the end of each context block, the compressor destroys its model of the data and approaches the next context block with a clean slate. At the end of each closed block, the compressor flushes its coder buffers. The output produced during the processing of a closed block is a complete representation of the closed block. At the end of each piece block the compressor writes zero or more bytes to the output but will not necessarily flush a complete representation of what it has read. At this stage, we have an interface that looks roughly like this: IN action - The action to be performed (identity,compress,decompress). INOUT memory - The compressor algorithm's ADT data (=activation record). IN inblock - The input block. OUT outblock - The output block. IN context? - Use the context from the previous block? IN flush? - Flush the coder? with the boolean flags (marked by "?") indicating when the algorithm should discard its context and flush its coder. 5. Flushing ----------- The input structure described in section 4 is general and will satisfy most users. However, it is also a bit messy. It is inevitable that programmers will be confused by the context and flush flags and subtle bugs will be created. It would be desirable to simplify the interface somehow. The following table shows how the four applications described in section 2 make use of the input structure described in section 4. +--------------------------------------------------------------+ | Context Blocks Closed Blocks Piece Blocks | +--------+--------------------------------------------------------------+ | UNIX | One One One byte long | | MEMORY | One One One per closed block | | MODEM | Many Many One per closed block | | FILE | Many Many One per closed block | +--------+--------------------------------------------------------------+ This table indicates that the input structure of piece blocks (and hence the flush flag) may not be necessary. The only application that uses piece blocks is the Unix compressor which could quite easily circumvent the problem using buffering. In general, absence of an optional-flush option becomes awkward only when the piece block size is very small (as in the case of the unix compressor). If the cost of flushing is (say) two bytes, forced flushing on piece blocks of 1024 bytes carries negligible cost whereas forced flushing of piece blocks of 1 byte carries an enormous cost. Additional reasons for the decision not to use a flush flag are given below: 1. The unit of compression is the closed block. Because the standard cannot guarantee that even the first byte of an uncompressed closed block can be made available until the last byte of the compressed representation of the closed block has been received, the piece block level is seen to be a buffering mechanism which is better performed externally. 2. The user program is better able to assess the availability of memory for buffering. SUMMARY: Although a difficult decision, my vote goes to the elimination of the flush flag and the compulsory imposition of flushing at the end of each data block presented to the compression algorithm. 6. Buffering ------------ Various arguments can be raised for providing an interface that forces the compression algorithm to perform various levels of buffering. Currently the algorithm allows the user to specify the size of input blocks but not the size of output blocks. Some might desire a standard in which the size of the input and output blocks can be specified by the user with the algorithm buffering a stream in the middle. The best argument against buffering is that it imposes an artificial layer and hides the closed block structure. In the end, it must be remembered that data compression algorithms should compress data and that this means receiving some data and making it smaller. Any attempt to hide the fact that the data has become smaller (e.g. by buffering) is a cosmetic addition that (while useful) should be located outside the standard. Perhaps there could be second order standard for a shell procedure that provides a soft buffered interface. 7. Definition of an Algorithm ----------------------------- The definition of a data compression algorithm provided an interesting challenge in the framing of the standard. Authors of data compression algorithms typically publish a paper describing the algorithm in generic terms. The "algorithm", which is often given a generic-sounding name, is usually specified as a parameterized object having parameters that affect the tradeoff between the following performance characteristics: COMPRESSION : Compression yielded by the algorithm. MEMORY : Memory required by the algorithm. SPEED : Speed of the algorithm. CODE COMPLEXITY : Degree of difficulty to implement. ADAPTIVITY : Speed at which the algorithm adapts to new data. GENERALITY : Degree to which algorithm is tailored for specific data. The presence of parameters and the often vague description of the algorithms themselves does not help in the comparison of algorithms. Those comparing algorithms are often forced to pick parameters in ignorance of the true impact of a particular parameter on the performance measures listed above. If the standard were to provide parameters in the interface which the user program could use to dial up various versions of particular algorithms, the parameters provided in the interface would vary in their semantic meaning from algorithm to algorithm. To resolve this mess, the standard prohibits parameters. This means that generic algorithms that have parameters must be instantiated with specific static parameter values before they can be standardized. The standard also forces the algorithm designer to specify how much memory the algorithm will require. The benefits here are enormous. By fixing algorithms in granite, the standard provides fixed points that investigators can refer to when comparing algorithms or classes of algorithm. By requiring that algorithm designers choose a good set of parameters before being allowed to standardize an algorithm, the standard places the responsibility for parameter choice in the hands of those who are in the best position to tune the algorithm. 8. Working Memory ----------------- To an extent, the arguments given above for the elimination of algorithm parameters from the interface also apply to the parameter of working memory. However, memory is special because it is a parameter whose semantics are independent of any particular data compression algorithm. It seems quite sensible to create a generic memory parameter in the interface, whereas the creation of "parameter1" (changes in which will have different effects for different algorithms) seems less justified. One problem with allowing the user to specify the memory that will be given to an algorithm is that it does not allow algorithms to specify a minimum memory requirement. Some algorithms need a certain minimum amount of memory (the exact amount being particular to the algorithm). For such cases, an alternative standard would have to either provide a method for the algorithm to convey the requirement to the user (as the proposed standard does) or provide some means for an algorithm to signal failure (which increases the messiness of the interface). Another disadvantage of user-controlled memory procurement is the problems that it causes with the decompressor. If the compressor is handed a quantity of memory that is not fixed for a given algorithm, then the decompressor has to be informed of the exact amount as well (the compressor can do this in the first few bytes of the coded message). More importantly, under this scheme, the decompressor does not learn of how much memory it will require of the user until it is already executing and has read the first few bytes of the compressed message. After a lot of thought I decided that the best solution was to force each algorithm to advertise exactly how much memory it required for compression and decompression. It is then the user's job to give the algorithm exactly the amount of memory requested. While this organization limits the capacity of an individual algorithm (a la the standard) to exploit the memory that could be made available to it, it does allow the user program to choose one of many algorithms depending on the memory that it knows will be available at the compression and decompression ends. Furthermore designers of a particular algorithm can release versions of an algorithm under related names that vary only in their memory requirements. E.g: lzrw1-1K lzrw1-8K lzrw1-16K lzrw1-32K The standard specifies that an algorithm be given no more memory than it requested. This eliminates the complication of conveying to the decompressor that more memory was used. Algorithm designers who want to do this can split their algorithms as described above. An added advantage of this scheme is that it is likely to lead to an informal standardization on a certain set of memory sizes, and this will make it even easier to compare data compression algorithms. In any design, there are always many arguments for including messy and complicated features. However, in most cases, the pain of eliminating a feature is more than compensated be the benefits conferred by the resulting simplicity. I believe that this issue of memory is such a case. On a related point, the standard specifies that the algorithm be given memory rather than seek it out from its environment (e.g. by allocating stack space or through a malloc) as the capacities for environments to provide memory vary from system to system. Examples: 1) The Vax architecture does not permit stack frames greater than 64K. 2) Heaps on the Macintosh sometimes becomes fragmented, making it impossible to allocate large contiguous chunks of memory. Furthermore, if it were possible for the data compression algorithm's request for memory to fail, then that failure condition would have to be dealt with somehow, further muddying the interface. By forcing the user program to front up with the memory before invoking the data compression algorithm, all error handling is performed outside the data compression algorithm. Furthermore, under the standard as proposed, if there is not enough memory to run a particular algorithm, the user program can choose another algorithm that requires less memory (assuming that a selection of algorithms have been coded into the program - quite possible) whereas this not an option available to each data compression algorithm. 9. Data Format -------------- The standard purposely steers away from defining any sort of compressed data format. In contrast, in any real system employing the standard, the user is forced to record the following information somewhere (either statically or dynamically): 1) The id of the algorithm used to compress a particular set of data. 2) The length of each compressed block of data. In addition, the user may also wish to store other attributes of the compression operation. As an example, the standard COULD specify that each algorithm place the length of each compressed block in the first four bytes of each compressed block. However, the standard sets no such conditions on the data format. The reason for this is purity of standardization. The goal of this standard is to define the interface between the user program and conforming data compression algorithms. To impose any format on the data would be extend the standard "beyond its terms of reference". 10. Generalizing the Interface ------------------------------ The interface that the proposed standard gives for data compression algorithms is likely to be similar for other general data transforming operations such as cryptographic and error correcting transformations, and it may be asked why a more general standard has not been developed for generalized information transformers. The reason why I did not do this is because when I investigated the idea, I found enough differences between the classes of generalized transformation listed to put me off. In addition, the issues that arose in the development of the proposed standard data compression algorithm interface convinced me that it would be best first to define standard interfaces for a variety of classes of transformers (with an eye for later combination) and then look for similarities that will allow them to be combined. While solving the more general problem may sometimes be easier than the specific (Polya), when designing an interface, important opportunities may be missed in the interface for one class of transforming algorithm. 11. Final Note -------------- This document gives particular reasons for particular design decisions. In many cases the reasons are an oversimplification of a cluster of factors pointing towards a particular decision. The author acknowledges that this document is sketchy and incomplete, but at this point would rather spend his time defending the standard against particular criticisms from particular people than polish this document. Further refinement of this document will take place once the standard has been aired. ----