Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site ames.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxl!houxm!houxz!vax135!cornell!uw-beaver!tektronix!hplabs!ames!jaw From: jaw@ames.UUCP (James A. Woods) Newsgroups: net.unix-wizards,net.research Subject: Lempel-Ziv Compression Made Swift (method of J. A. Woods), part 1 Message-ID: <425@ames.UUCP> Date: Thu, 26-Jul-84 00:39:02 EDT Article-I.D.: ames.425 Posted: Thu Jul 26 00:39:02 1984 Date-Received: Sat, 21-Jul-84 03:39:50 EDT Organization: NASA-Ames Research Center, Mtn. View, CA Lines: 109 # "It's fresh squozen daily." -- official slogan of the Ames fast file finder (yet another data compression technique) Herein we report on a simple and fleet implementation of the LZW algorithm now making the rounds on UNIX. For equivalent compression rates (2:1 for unix-wizards, 3:1 and more for numerical data), it runs at least 1.6 times faster for large files than the code offered by Spencer Thomas. This is a C language comparison only; a greater speed differential is expected if both versions were written in machine language). This method may be impractical for output codes of length greater than 13 bits, though, fortunately, the basic algorithm offers only diminishing returns beyond this. Like Mr. Thomas, I adapted the Welch implementation to a different data structure, believing that the hashing method reported in the June IEEE Computer to be "too slow". Essentially, the code below is a direct-access lookup of the alternate string table pictured in the article. There are 2 ** BITS codes which serve as prefixes to be associated with any of 256 ascii characters. For the typical BITS == 12, this looks unwieldy, but is actually workable on a 32-bit processor with the enhancements I describe. A first pass took the form: short table [(1 << BITS) * 256]; /* 2 megabytes for BITS = 12 */ squash ( ) { register int K, i, p, prefix; register long code; bzero ( table, 1 << BITS ); /* clear table, but see below */ for ( i = 0; i < 256; i++ ) table [i] = i; prefix = getchar ( ); while ( (K = getchar ( )) != EOF ) { code = (long) ((K << BITS) + prefix); /* test for code in "string" table */ if ( (p = table [code]) != 0 ) prefix = p; else { /* put code in table */ output ( prefix ); prefix = K; if ( i < (1 << BITS) ) /* ran out of codes */ table [code] = (short) i++; } } output ( prefix ); /* last one */ } The table initialization time (Is bcopy() the fastest way to do this in C?) is disconcerting, some 6 seconds of system time (2 MB) on our 11/750. This is unacceptable for small files. It turns out that, using a variant of the sparse array initialization trick discussed by J. Bentley (CACM, 8/83), we can associate a bit-vector with the large code table, and save time by setting a bit in this map before each read access to the table, and by testing the bit before each write. The macros provided in M. D. McIlroy's now defunct V7 "spell" code, can be used for these bit tests: #define SHIFT 5 #define get(h) (bits[h>>SHIFT]&(1<<((long)h&((1<>SHIFT] |= 1<<((long)h&((1<