Path: utzoo!utgpu!jarvis.csri.toronto.edu!dgp.toronto.edu!hugh Newsgroups: comp.os.minix From: hugh@dgp.toronto.edu (D. Hugh Redelmeier) Subject: compression Message-ID: <1989Aug20.014222.10314@jarvis.csri.toronto.edu> Organization: University of Toronto, CSRI Date: 20 Aug 89 05:42:22 GMT | From: ast@cs.vu.nl (Andy Tanenbaum) | Subject: compression | Message-ID: <2888@ast.cs.vu.nl> | | I have gone through the net postings and found a number of useful programs. | I have this fear that the next distribution is going to take 20 disks and cost | $149. I have also noticed that compress only seems to win a factor of 2. | I wonder if better compression of C programs is possible. In particular, | suppose we had a two pass compression program. Pass 1 collected all the | strings in the program and their frequencies. Pass 2 replaced the most | ... Over a decade ago I wrote a Huffman compression program for UNIX. It worked on "tokens" rather than characters. I put the program in the public domain, but few people noticed (there was no USENET then). Since "compress" is the standard with which everyone is familiar, I will contrast my program "huf" with compress. - huf often compresses a little bit more than 16-bit compress. huf is quite a bit better than 12-bit compress. 39614 compress.c [my copy of compress] 22842 compress.c.a [arced with arc 5.21] 21826 compress.c.Z12 [compressed with -b12] 19871 compress.c.Z13 [compressed with -b13] 19141 compress.c.Z16 [compressed without -b] 17996 compress.cH [huffed] And since turnabout is fair play: 41053 huf.c 21693 huf.c.a 22713 huf.c.Z12 20913 huf.c.Z13 19610 huf.c.Z16 18047 huf.cH - huf was intended to compress source programs and natural language (English) documents. It only accepts ASCII text (no characters above 0177). It cannot handle texts with a large number of long distinct tokens (which is what uuencode output looks like). - huf runs in 64k+64k space (UNIX only ran on PDP-11s when it was written). In fact, it can be happy in much less. - huf is slower on compression, but faster on decompression (last time I checked). - compress is much more widely accepted, an important virtue. Huf has been very stable. The format of compressed files has never changed. I have never lost data because of bugs in it (but there was an obscure bug with such a potential discovered in 1986). Huf works on V7, 4.2BSD, and SystemVr2. I presume it would work on MINIX. Files are not affected by byte order, but it is assumed that bytes are eight bits. I think that two's complement arithmetic is required. Here are some ideas for improvements. I am not planning to make these changes but welcome anyone else who would like to. If huf becomes widespread, changing the format of the compressed file becomes a large logistics problem. - In the dictionary, characters are represented in a 6-bit code. A few percent improvement could be achieved by encoding these characters in a one-character-per-symbol Huffman code. This would also make it cheap to allow non-ASCII characters in the text. - With some care, huf could handle currently-pathological texts. If the lexical analysis module noticed a problem (non-ascii text or an excessive number of distinct tokens) it could switch to another mode that would parse the text into tokens differently (say, single-byte tokens). - One major oportunity for compression is external fragmentation. In UNIX, a file is stored in an integral number of blocks (or fragments, in BSD). The last block, on average, is less than half full. This is especially bad for short files. This can be eliminated if the compression program also creates a more compact simulation of a file system. arc does this. tar does not -- it uses more blocks than the UNIX file system. Would MINIX ar do this? - The hashing function could be sped up. A new hash function would not affect the portability of compressed files, but the code implementing the hash function might not be portable. (Multiplies were relatively cheap on the machines on which huf was developed.) - Since the hashing function multiplies two ints, it is probably particularly slow on Atari ST MINIX. If the compiler knows how to do short*short with a single machine instruction, changing the hashing function to use shorts should make it run much faster. Putting HUF up on MINIX ======================= - Since I don't have MINIX, Mark Moraes very kindly tested huf on his ST MINIX. This testing has not been extensive. - huf.c uses varargs.h. This is not part of older versions of MINIX. I have included the one distributed by AST in the newsgroup as part of the Version 1.4 upgrade. - To configure huf.c, give the flag -DV7_16 to the compilation of huf.c Mark says -O worked for him. A chmem will be needed: Mark found huf could compress itself with a chmem +500; more would be wise. -DV7_16 specifies that huf is being compiled for and on a Version 7 UNIX system with 16-bit ints and pointers. MINIX is close enough to Version 7 for this purpose. "16-bit ints and pointers" is correct for MINIX on a PC-clone. It is a useful lie on the ST: it should cause the hashing multiply to be a little faster, without causing any problems. - I don't know how MINIX man-pages are distributed. I am including a UNIX man-page. If can be formatted on unix using: [nt]roff -man huf.1 Hugh Redelmeier {utcsri, yunexus, uunet!attcan, utzoo, hcr}!redvax!hugh When all else fails: hugh@csri.toronto.edu +1 416 482-8253 #!/bin/sh # To unbundle, sh this file echo huf.1 1>&2 sed 's/^-//' >huf.1 <<'!' -.TH HUF 1 Local -.SH NAME huf \- encode and decode Huffman coded text files -.SH SYNOPSIS -.B huf -[ -\fB\-npuvw\fIn\fR -] [ name ] ... -.SH DESCRIPTION -.IR Huf is a program for compressing ASCII text files. Text compression depends on the repetition of -\fIwords\fP, that is, strings of alphanumeric characters. On average, a compressed file is about half the size of the original. The argument list of this command is a sequence of file names and flags. -.PP A file whose name ends in ``H'' is decoded into a file with the same path name, minus the ``H'' (unless the -``\fBp\fP'' option is used). -.PP A file whose name does not end in ``H'' is encoded into a file with the same path name, plus an ``H'' -(unless the ``\fBp\fP'' option is used). The encoding process takes two passes, requiring that the file not be a device or pipe. The file must not be modified while being encoded. -.PP Several attributes of the newly created file are set to match those of the file it was derived from: owner, group, permissions, modification time, and access time. -.PP A flag operand starts with ``\-'' and continues with any number of option names. Each option name (except \fBw\fP) complements the current setting of the corresponding option. Each option is initially off. -.TP 8 -.B \-n -(for noisy) option causes various statistics to be printed on the diagnostic output. -.TP -.B \-p causes the output to be printed on the standard output (even if this output is encoded). -.TP -.B \-u causes the input file to be unlinked. If any error is detected(!) during translation, the file is not unlinked. -.TP -.B \-v causes the output file to be verified: the reverse transformation is performed and compared with the input. Verification has not been necessary. -.TP -.BI \-w n sets the upper bound on the number of unique words allowed in a file to be encoded. If this bound is exceeded, the amount of compression is reduced. This bound is set to the smallest prime number greater than -\fIn\fP that is known to the program. The default bound is about 3000; the maximum is about 7000. -.SH "LOCAL INFO" Written at the University of Toronto by D. Hugh Redelmeier. ! echo varargs.h 1>&2 sed 's/^-//' >varargs.h <<'!' -/* varargs.h */ typedef char *va_list; -#define va_dcl int va_alist; -#define va_start(p) (p) = (va_list) &va_alist; -#define va_arg(p,type) ( (type *) ((p)+=sizeof(type)) )[-1] -#define va_end(p) -#define vfprintf _doprintf -#define vprintf(fmt,args) vfprintf(stdout,fmt,args) ! echo huf.c 1>&2 sed 's/^-//' >huf.c <<'!' -/* Huf - encode and decode Huffman coded text files. - * - * D. Hugh Redelmeier 1978 June 24 - * - * To Compile (See "Portability Issues" below): - * - * cc -s -O -i -z -Dsystem % - * -s: strip - * -O: optimize - * -i: split I and D on PDP-11 etc. - * -z: trap references through NULL on System Vr2 - * -Dsystem: chose configuration (CUSTOM, V7_16, V7_32, BSD_32, or SYSV_32) - * - * To Execute: - * - * The arguments to this command are a sequence of file names - * and flags. These aruments are processed left to right. - * - * A flag starts with "-" and continues with any number of - * options. - * - * The "n" (for noisy) option causes various statistics to - * be printed on the diagnostic output. - * The flag complements the option which is initially off. - * - * The "l" (for local) option causes the output file to - * be placed in the current directory (with the appropriate - * filename). The flag complements the option which is - * initially off. - * - * The "p" option causes the output to be Printed on the - * standard output (even if this output is encoded). - * The flag complements the option which is initially off. - * - * The "u" option causes the input file to be Unlinked. - * If any error is detected(!) during translation, the file - * is not unlinked. For safety reasons, this option may not - * be use with the print option. The flag complements the - * option which is initially off. - * - * The "v" option causes the new file to be verified after - * creation. The verification occurs before any unlink operation - * implied by the "u" option. Verification is not be possible - * with the "p" option. Verification has never proven - * necessary with working hardware. The flag complements - * the option which is initially off. - * - * The "w" option sets the upper bound on the - * number of unique Words allowed in a file to be encoded. - * To reduce hashing collisions, ten percent more words than - * specified are reserved but not used. The resulting number - * is moved up to the nearest prime known to the program. - * - * The "d" (for dump) option causes huf to abort when an error - * is detected. - * - * A file whose name ends in is decoded into a file - * with the same path name, minus the (unless the - * "p" option is used). - * - * A file whose name does not end in is encoded - * into a file with the same path name, plus - * (unless the "p" option is used). The encoding process - * takes two passes, requiring that the file not be a device. - * The file must not be modified while being encoded. - * - * Structure of encoded file: - * - * bits(byteLen) magicNum&byteMask, (magicNum>>byteLen)&byteMask; - * bits(nwLen) numWords; - * bits(specLen) numSpecs-1; - * bits(specLen) specs[numSpecs]; // in frequency order - * wc words[numWords]; // in frequency order - * bit spec_tree_building_decisions[]; - * bit word_tree_building_decisions[]; - * bit starts_with_word?; - * bit hufman_code[]; - */ -/* Modification History: - * - * 1981?; DHR: Ported to V7 on a VAX (much wailing & gnashing of bits). - * 1984 Sept; DHR: Copy modtime of original file to derivative. - * 1984 Nov; DHR: Make overflow symbols into solitaries. - * Added "leap" to speed decoding. - * Improved the hash function and avoided clustering. - * 1984 Dec; DHR: Skip bad files (instead of quitting). - * Try to copy owner and group of original file. - * 1986 Oct 31; DHR: Fix off-by-one bug in lex() affecting tabs. - * The effect of this bug is to turn certain newlines into - * '`' (if the preceding newline was followed by more than - * 12 HTabs, and this newline is followed by none). In a - * similar circumstance, semicolon followed by newline would - * be turned into '@'). - * Sped up hashing by eliminating one multiply per word. - * Added the l option. - * Added #ifdef for VMSUNITY. - * 1987 May 12; DHR: Improve portability: - * Use varargs.h for errf(). - * Allow malloc to substitute for brk/sbrk (but not well). - * Declare procedures as returning void. - * Cut down number of write() calls from errf(). - * 1987 September 27; DHR: Make ANSI compatible - * Yet another change to the diagnostic stuff: - * - diagfmt() formats a message - * - diagpr() prints in on stderr - * - error() no longer takes a variable number of args; - * if its arg is NULL, it finds the message in the diagnostic - * buffer (formatted by diagfmt(), but not yet printed). - * - added function prototypes. - * 1989 August 14; DHR: Make squeaky clean. - * - Conditionally use ANSI's stdarg.h. - * - added "static" to all non-exports. - * - A few more things are "unsigned". - * - A few more coecions are explicit. - * - Parameters of compar() are "pointer" (not "tree *") - * - Fixed bug in error(). - * - Avoid MINIX C bug [would not accept x++ -> y]. - * - Tinkered with formatting. - */ -/* Portability Issues: - * - * Memory allocation can be done using sbrk() or malloc(). - * sbrk() is the more efficient, and the original technique. - * malloc() has been hacked in recently. To use malloc(), - * define the symbol USEMALLOC. The version using malloc() - * is quite crude and wasteful: it allocates a maximum - * amount of memory and aborts if this is not enough. - * The amount allocated is the value of the symbol USEMALLOC! - * - * If memory allocation is done using sbrk(), no mallocs may - * be used, even by library routines (hence errf() is defined - * and used instead of fprintf(stderr, ...): stdio is not safe). - * - * The overlaying of various memory structures is dicey, but - * it is probably portable. - * - * The size of an int must be at least twice the size of a byte - * and greater than nwLen. - * - * "hashMult" is sensitive to hardware wordlength, and will be - * wrong unless the machine is a byte machine with exactly 32-bit - * longs. This will not be a disaster. - * - * "bufSize" should be a multiple of the file system block size: - * bigger is faster, but wastes space. - * - * The value of nwLen reflects the size of the PDP-11 address - * space. To enable larger address spaces to be exploited, - * it should be increased (larger entries would have to be added - * to primes[]). Changing nwLen would make old encoded files - * incompatible so the limit has not been changed. - * - * The format of the encoded file should be universal unless - * one of the constants mentioned in the format description - * is changed. - * - * Under System V, we define a dummy routine "_cleanup". - * This maybe wrong for some systems. See the explanation - * where _cleanup is defined. - * - * Berkeley renames utime(2) as utimes(2). If running - * under Berkeley, define the preprocessor symbol "BSD". - * - * Under Berkeley 4.2, file name components of path names are - * not limited in length. Define the symbol LONGFILENAMES - * - * "HZ", the clock frequency, must be a multiple of 10. - * Otherwise, the times reported by noisy mode will be funny - * (big deal). - * - * VMS/Unity is H.C.R.'s emulation of System V under DEC's - * VAX/VMS system. It does not support chown() and utime(). - */ -/* The following include is needed for . - * It is included early because (under BSD) it defines size_t. - */ -#include -/* ---------------- Custom Configuration ---------------- */ -#ifdef CUSTOM -/* #define BSD 1 // defined iff running under Berkeley */ -/* #define LONGFILENAMES 1 // defined iff no bound on filename */ -/* #define SYSV 1 // defined iff running under System V */ -/* #define VMSUNITY 1 // defined iff running under VMSUNITY */ -/* #define USEMALLOC 65534 // for systems on which sbrk() doesn't work */ -/* typedef int void; // needed for older compilers (e.g. V7) */ -/* typedef unsigned int size_t; // type of result of sizeof */ -/* hashMult is the inverse golden ratio scaled by the word length. - * See Knuth Volume 3, 6.4. - * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1 - */ -/* #define hashMult 20251 // 16-bit int */ -/* #define hashMult 1327217885 // 32-bit int */ -#define HZ 60 /* clock frequency */ -#define bufSize 1024 /* bigger is faster; no other effect */ -#endif /* Custom Configuration */ -/* ---------------- V7, 16-bit int and char * ---------------- */ -#ifdef V7_16 typedef int void; /* needed for older compilers (e.g. V7) */ typedef unsigned int size_t; /* type of result of sizeof */ -/* hashMult is the inverse golden ratio scaled by the word length. - * See Knuth Volume 3, 6.4. - * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1 - */ -#define hashMult 20251 /* 16-bit int */ -#define HZ 60 /* clock frequency */ -#define bufSize 1024 /* bigger is faster; no other effect */ -#endif /* V7 on 16-bit machine */ -/* ---------------- V7, 32-bit int and char * ---------------- */ -#ifdef V7_32 typedef int void; /* needed for older compilers (e.g. V7) */ typedef unsigned int size_t; /* type of result of sizeof */ -/* hashMult is the inverse golden ratio scaled by the word length. - * See Knuth Volume 3, 6.4. - * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1 - */ -#define hashMult 1327217885 /* 32-bit int */ -#define HZ 60 /* clock frequency */ -#define bufSize 1024 /* bigger is faster; no other effect */ -#endif /* V7 on 32-bit machine */ -/* ---------------- BSD, 32-bit int and char * ---------------- */ -#ifdef BSD_32 -#define BSD 1 /* defined iff running under Berkeley */ -#define LONGFILENAMES 1 /* defined iff no bound on filename */ -/* size_t defined in */ -/* hashMult is the inverse golden ratio scaled by the word length. - * See Knuth Volume 3, 6.4. - * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1 - */ -#define hashMult 1327217885 /* 32-bit int */ -#define HZ 100 /* clock frequency */ -#define bufSize 1024 /* bigger is faster; no other effect */ -#endif /* BSD version */ -/* ---------------- System V, 32-bit int and char * ---------------- */ -#ifdef SYSV_32 -#define SYSV 1 /* defined iff running under System V */ typedef unsigned int size_t; /* type of result of sizeof */ -/* hashMult is the inverse golden ratio scaled by the word length. - * See Knuth Volume 3, 6.4. - * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1 - */ -#define hashMult 1327217885 /* 32-bit int */ -#define HZ 100 /* clock frequency */ -#define bufSize 1024 /* bigger is faster; no other effect */ -#endif /* SysV on VAX version */ -/* end of system specific definitions */ -#define null 0 -#ifdef __STDC__ -#define proto(p) p /* use prototype information */ typedef void *pointer; /* universal pointer type */ -#else /*!__STDC__*/ -#define proto(p) () /* ignore prototype information */ typedef char *pointer; /* universal pointer type */ -#endif /*!__STDC__*/ -/* A node pointer is a pointer to one of: - * - an Eleaf (an encoding leaf) - * - a Dleaf (a decoding leaf -- a character string) - * - an intrn (an internal node) - * - * The way these are discriminated between is by where the pointer points. - * This is not a union of pointers (because no matter which is being - * pointed to, the pointer representation must be identical so compares - * work). Nor is it a pointer to a union (because a union would be - * aligned on a boundary suitable for its most-aligned component, and - * this might change the pointer representation). It must be represented - * as a universal pointer, and cast when it must be used as a pointer - * to a particular kind. - */ typedef pointer nodePtr; typedef struct { /* Huffman tree internal node */ - nodePtr left; - nodePtr right; -} intrn; typedef struct { /* Huffman tree root */ - unsigned Tcount; /* Tree frequency count */ - nodePtr root; /* top node */ -} tree; /* must be no bigger than Eleaf */ typedef union { /* Huffman Encoding tree leaf */ - unsigned Lcount; /* Leaf frequency count */ - long code; /* Huffman code for this leaf */ - tree FUDGE3; /* force size requirement */ -} Eleaf; /* must be as big as tree */ typedef char Dleaf; /* decoding leaf: a flag-terminated string */ -#define numAscii 128 -#define asciiMask (numAscii-1) -#define aFlag 0200 -#define perNl '0' -#define commaSp '1' -#define commaNl '2' -#define eof '9' static nodePtr eofPtr; /* Eleaf or Dleaf for eof */ -#define maxTabs (('z'-'a')/2) -#define nlTabs ('a'+maxTabs) -#define scNlTabs ('A'+maxTabs) static int curTabs; -#define numWc 64 -#define wcLen 6 -#define wcMask (numWc-1) static char wcToA[numWc]; static char aToWc[numAscii]; static void initWordCode proto((void)); -#define wfBias numAscii -#define maxSpecs (numAscii+wfBias) -#define specLen 8 -#define specMask (maxSpecs-1) -#define solitary maxSpecs static nodePtr solPtr; /* Eleaf or Dleaf for solitary */ -#define byteLen 8 -#define byteMask ((1< static jmp_buf skipFileJmpBuf; static int retCode = 0; static void encode proto((void)); static void initLex proto((void)); static void censusPass proto((void)); static void encWord proto((Eleaf *w)); static void assignCodes proto((nodePtr nodeArg, long c, unsigned cl)); static void decode proto((void)); -#include -#ifdef SYSV extern long times proto((struct tms *)); -#else /*!SYSV*/ -#ifdef BSD extern long times proto((struct tms *)); -#else /*!BSD*/ extern void times proto((struct tms *)); /* V7 is in the minority */ -#endif /*!BSD*/ -#endif /*!SYSV*/ static struct tms tbuffer; -#ifdef BSD -#define utimes utime -#endif struct utimbuf { /* under V7 & BSD: should be an array */ - time_t actime; /* access time */ - time_t modtime; /* modification time */ -}; extern int utime proto((char *, struct utimbuf *)); static struct utimbuf ftimes; -#include extern int stat proto((char *, struct stat *)); extern int fstat proto((int, struct stat *)); extern int chown proto((char *, short, short)); static struct stat statbuf; -#define hashMask 0777 /* smaller than every prime[] */ static unsigned primes[]={ 607, 947, 1109, 1511, - 1913, 2393, 2707, 3121, - 3529, 3911, 4363, 4721, - 5167, 5521, 5939, 6311, - 6719, 7121, 7541, 0 }; int main(argc,argv) - int argc; - char *argv[]; -{ - register char *p; - time_t putime, pstime; - parmsLeft=argc; - curParm=argv; - myName = *curParm; - initWordCode(); -#ifdef USEMALLOC - mem= (char *) malloc(USEMALLOC); - if (mem==null) - error("can't malloc at all"); - memRoof=mem+USEMALLOC; -#else /*!USEMALLOC*/ - memRoof=mem=(char *) sbrk(0); -#endif /*!USEMALLOC*/ -#ifdef SYSV - (void) -#endif - times(&tbuffer); - while (--parmsLeft != 0) { - p = *++curParm; - if (*p=='-') { - while (*++p) { - switch (*p) { - case 'd': - dump = !dump; - break; - case 'n': - noisy = !noisy; - break; - case 'l': - local = !local; - break; - case 'p': - print = !print; - break; - case 'u': - unl = !unl; - break; - case 'v': - verify = !verify; - break; - case 'w': - { - unsigned n; - unsigned *pp; - for (n=0; '0'<= p[1] && p[1]<='9';) - n=n*10+(*++p-'0'); - for (pp=primes; *pp=0) - (void) close(inf); - } else { - inName=pname; - for(verifying=0;verifying<=verify;verifying++) { - inf=open(inName,0); - if (inf<0) - error("+Can't open"); - if (fstat(inf,&statbuf) == -1) - crash("Can't fstat input file"); - if ((statbuf.st_mode&S_IFMT)!=S_IFREG - && (statbuf.st_mode&S_IFMT)!=S_IFBLK) - { - diagfmt("+Invalid file mode (0%o)", statbuf.st_mode&S_IFMT); - error((char *) null); - } - ftimes.modtime = statbuf.st_mtime; - if (noisy) { - diagfmt("%s: ",inName); - diagpr(); - } - outpos=outbuf; - outLen=0; - bits=nxtBit=0; - if (fnEnd!=fname && fnEnd[-1]==suffix) - decode(); - else - encode(); - inLine=0; - putbuf(); - if (fstat(inf,&statbuf) == -1) - crash("Can't fstat input file"); - if (ftimes.modtime != statbuf.st_mtime) - error( - "Input file modified while processing"); - ftimes.actime = statbuf.st_atime; - (void) close(inf); - if (outf!=1) { - (void) close(outf); -#ifndef VMSUNITY - if (!verifying) { - if (utime(outName,&ftimes)!=0) - error("Can't set modification time"); - (void) chown(outName,statbuf.st_uid,statbuf.st_gid); - } -#endif - } - if (noisy) { - diagfmt( - "%D bytes in; %D bytes out.\n", - inLen,outLen); - diagpr(); - putime=tbuffer.tms_utime; - pstime=tbuffer.tms_stime; -#ifdef SYSV - (void) -#endif - times(&tbuffer); - diagfmt( - "%Dds user; %Dds system.\n", - (tbuffer.tms_utime-putime)/(HZ/10), - (tbuffer.tms_stime-pstime)/(HZ/10)); - diagpr(); - } - inName = outName; - } - if (unl) { - if (unlink(*curParm) == -1) - error("Can't unlink"); - } - } - } - } - return retCode; -} static void initWordCode() -{ - register int i; - register int c; - for (c=i=0; i!=numAscii; i++) { - if ('a'<=i && i<='z' || 'A'<=i && i<='Z' - || '0'<=i && i<='9' || i=='_') - { - c++; - if (c>=numWc) - crash("Bad wc"); - wcToA[c]=i; - aToWc[i]=c; - } else { - aToWc[i]=0; - } - } -} -/* - * Input/Output Routines - */ -/* diagfmt() is much like sprintf to DiagMess, however it avoids stdio. - * This avoidance is because stdio takes memory & uses malloc. - */ static void diagpr() -{ - if (DiagPtr static void diagfmt(char *fmt, ...) -{ -#else /*!__STDC__*/ -#include -/*VARARGS1*/ static void diagfmt(va_alist) - va_dcl -{ -#endif /*!__STDC__*/ - va_list argp; - register char *fp; - register char *fnp; - register int b; - register long n; - static char fnum[20]; - register int neg; -#ifdef __STDC__ - fp = fmt; - va_start(argp, fmt); -#else /*!__STDC__*/ - va_start(argp); - fp = va_arg(argp, char *); -#endif /*!__STDC__*/ - DiagPtr = DiagMess; - for (;;) { - b = *fp++; - if (b == '\0') { - break; - } else if (b == '%') { - switch (*fp++) { - case '%': - errc('%'); - continue; - case 'c': - errc(va_arg(argp, int)); - continue; - case 's': - errs(va_arg(argp, char *)); - continue; - case 'o': - b = 8; - n = va_arg(argp, unsigned); - break; - case 'O': - b = 8; - n = va_arg(argp, long); - break; - case 'l': - b = 10; - n = va_arg(argp, unsigned); - break; - case 'd': - b = 10; - n = va_arg(argp, int); - break; - case 'D': - b = 10; - n = va_arg(argp, long); - break; - default: - crash("bad format code"); - } - fnp = &fnum[sizeof fnum]; - *--fnp = '\0'; - if ((neg = (n<0)) != 0) - n = -n; - do { - *--fnp = n%b + '0'; - n /= b; - } while (n != 0); - if (neg) - *--fnp = '-'; - errs(fnp); - } else { - errc(b); - } - } - va_end(argp); -} -/*ARGSUSED*/ static void crash(cause) - char *cause; /* read this with a debugger */ -{ - (void) abort(); - exit(2); /* just in case! */ -} static void error(m) - char *m; -{ - int errSkip; - char mess[MAXDIAG+1]; - if (m == (char *) null) { - register char *from = DiagMess; - register char *to = mess; - m = mess; - while (from != DiagPtr) - *to++ = *from++; - *to = '\0'; - } - if ((errSkip = *m=='+') != 0) - m++; - diagfmt("%s(%s)",myName,*curParm); - diagpr(); - if (inLine) { - diagfmt(" in line %l",inLine); - diagpr(); - } - diagfmt(": %s%s\n",m,errSkip? " -- file skipped." : "."); - diagpr(); - if (dump) - crash("Error."); - if (errSkip) { - retCode = 2; - longjmp(skipFileJmpBuf,1); - } - exit(3); -} static void nextBlock() -{ - register int len; - if (inpos>inend) - error("Unexpected EOF"); - inpos=inend=inbuf; - len=read(inf,inbuf,(size_t) bufSize); - if (len<=0) { - if (len<0) - error("Input I/O error"); - *inend = '\0'; - } - inLen += len; - inend += len; -} -#define needByte() { if (inpos>=inend) nextBlock(); } static void addByte() -{ -#ifndef I8086 /* avoid hardware(!) bug */ - if (bits>>nxtBit != 0) - crash("Bad bits"); -#endif - needByte(); - bits |= (*inpos++ & byteMask) << nxtBit; - nxtBit += byteLen; -} -#define needBits(n) { if(nxtBit<(n)) addByte(); nxtBit -= (n); } static void croutf() -{ - register unsigned mode; - if (verifying) { - outName = pname; - outf = open(outName,0); - } else { - outName = local? fname : pname; - mode=statbuf.st_mode; - if (stat(outName,&statbuf) != -1) { - diagfmt("\"%s\" already exists; I will not overwrite", - outName); - error((char *) null); - } - outf=creat(outName,(int)mode); - } - if (outf == -1) { - diagfmt("Can't create \"%s\"",outName); - error((char *) null); - } -} static void putbuf() -{ - register size_t i; - i = outpos-outbuf; - outLen += i; - if (verifying) { - if (read(outf,verbuf,(size_t) bufSize) != i) - error("Bad length on verify"); - while (i--) - if (verbuf[i] != outbuf[i]) - error("Verification failed"); - } else { - if (write(outf,outbuf,i) != i) - error("Output I/O error"); - } - outpos=outbuf; -} -#define outch(c) { *outpos++ = (c); if(outpos==outbuf+bufSize) putbuf(); } static void gotByte() -{ - outch(bits); - bits >>= byteLen; - nxtBit -= byteLen; -} -#define gotBits(n) { nxtBit += (n); if(nxtBit>=byteLen) gotByte(); } -/* Encoder dynamic memory structures: - * - * Eleaf leaves[maxSyms]; - * leaves, eofPtr, solPtr, leavesRoof. - * tree forest[maxSyms] overlaying leaves; - * (so we require sizeof(tree) <= sizeof(leaf)) - * forest synonym for leaves. - * intrn intrns[maxWords]; - * intrns. - * char *words[maxWords] overlaying start of intrns; - * words. - * char chars[] overlaying end of intrns and extending beyond; - * chars, nxtChar, charRoof (synonym for memRoof). - */ static Eleaf *leaves, *leavesRoof; -#define forest ((tree *) leaves) static char **words; static unsigned wordLimit; static unsigned wordsLeft; static unsigned solCnt; static unsigned oflowCnt, oflowChars; static unsigned maxSyms; static unsigned maxCodeLen; static long tabLen; static long symLookups,hashProbes; static long bytesSpecials,bytesWords; static char *charCeil; /* charRoof-1 */ static Eleaf *lex proto((void)); /* declare forward */ static void encode() -{ - register int i; - pointer memNeed; - register Eleaf *symbol; - register long lbits; - wordLimit=maxWords-maxWords/11; /* consider table full at 90% */ - maxSyms=maxSpecs+1+maxWords; - leaves = (Eleaf *) mem; - solPtr = (nodePtr) &leaves[solitary]; - eofPtr = (nodePtr) &leaves[eof]; - leavesRoof = &leaves[maxSyms]; - intrns = (intrn *) leavesRoof; - memNeed = (pointer) &intrns[maxWords]; - words = (char **) leavesRoof; - chars = (char *) &words[maxWords]; - if (memNeed < (pointer)chars) - memNeed = (pointer)chars; - if ((pointer)memRoof < memNeed) { -#ifdef USEMALLOC - error("out of fixed memory allocation"); -#else /*!USEMALLOC*/ - if (brk(memNeed) == -1) - error("Can't get memory needed"); - memRoof=memNeed; -#endif /*!USEMALLOC*/ - } - charCeil=charRoof-1; - maxCodeLen=0; - hashProbes=0; - symLookups=0; - oflowChars=0; - censusPass(); - initLex(); - bits |= (aToWc[*inpos&asciiMask]!=0) << nxtBit; - gotBits(1); - do { - symbol=lex(); - lbits=symbol->code; - if (lbits == -1L) - lbits=((Eleaf *) solPtr)->code; -#ifdef DEBUG - diagfmt("sym %o outpos %d nxtBit %d bits %o code %O\n", - symbol,outpos-outbuf,nxtBit,bits,lbits); - diagpr(); -#endif /*DEBUG*/ - i=lbits&lenMask; - lbits = lbits>>lenLen<=byteLen) { - outch( (char) lbits ); - lbits >>= byteLen; - nxtBit -= byteLen; - } - bits=lbits; - if ((nodePtr)symbol==solPtr || symbol->code==-1L) - encWord(symbol); - } while ((nodePtr)symbol!=eofPtr); - if (bits!=0) { - outch(bits); - if (bits<0) - crash("Signed bits"); - } - if (noisy) { - diagfmt("%d of %d words containing %d characters.\n", - wordLimit-wordsLeft,maxWords,nxtChar-chars); - diagpr(); - diagfmt("%d solitaries", solCnt-oflowCnt); - diagpr(); - if (oflowCnt != 0) { - diagfmt("; %d words of %d characters overflowed", - oflowCnt, oflowChars); - diagpr(); - } - diagfmt(".\n%D symbol lookups; %D hash probes.\n", - symLookups,hashProbes); - diagpr(); - diagfmt("%D bytes of tables; %d bits in longest code.\n", - tabLen,maxCodeLen); - diagpr(); - diagfmt("%D bytes of coded specials; %D bytes of coded words.\n", - bytesSpecials,bytesWords); - diagpr(); - } -} static long buildTree proto((tree *, tree *, Eleaf*)); static int compar proto((pointer, pointer)); static void censusPass() -{ - register Eleaf *symbol; - tree *wf; /* word forest */ - register tree *wfn; /* word forest next */ - register tree *sfn; /* spec forest next */ - tree *treep; - initLex(); - for (symbol=leaves; symbol!=leavesRoof; symbol++) - symbol->Lcount=0; - do { - symbol=lex(); -#ifdef DEBUG - diagfmt("Symbol %d\n",symbol-leaves); - diagpr(); -#endif /*DEBUG*/ - if (++(symbol->Lcount) <= 0) { - diagfmt("Counter overflow. Symbol 0%o",symbol-leaves); - error((char *) null); - } - } while ((nodePtr)symbol!=eofPtr); - inLine=0; - sfn=forest; - for (symbol=leaves; (nodePtr)symbol!=solPtr; symbol++) { - if (symbol->Lcount!=0) { - sfn->Tcount=symbol->Lcount; - sfn->root = (nodePtr) symbol; - sfn++; - } - } - oflowCnt = solCnt = ((Eleaf *)solPtr)->Lcount; - wf = wfn = (tree *) symbol; - while (++symbol != leavesRoof) { - if (symbol->Lcount != 0) { - if (symbol->Lcount==1) { - if(++solCnt <= 0) - error("Solitary counter overflow."); - } else { - wfn->Tcount=symbol->Lcount; - wfn->root = (nodePtr)symbol; - wfn++; - } - } - } - if (solCnt!=0) { - wfn->Tcount=solCnt; - wfn->root = solPtr; - wfn++; - } - outf=1; - if (!print) { - if (fnEnd>=fname+fnSize) - error("+File name too long"); - *fnEnd++ = suffix; - *fnEnd = 0; - croutf(); - } - outch((char) (magicNum&byteMask)); - outch((char) (magicNum>>byteLen & byteMask)); - bits=wfn-wf; /* nxtBit must be zero or else! */ - if (bits > nwMask) { - diagfmt("Too many unique words: %d", bits); - error((char *) null); - } - gotBits(nwLen); - bits |= (sfn-forest-1) << nxtBit; - gotBits(specLen); - qsort((pointer) forest, (size_t) (sfn-forest), sizeof(tree), compar); - for (treep=forest; treep!=sfn; treep++) { - bits |= ((Eleaf *)treep->root - leaves) << nxtBit; - gotBits(specLen); - } - if (wfn!=wf) { - qsort((pointer) wf, (size_t) (wfn-wf), sizeof(tree), compar); - for (treep=wf; treep!=wfn; treep++) { - if (treep->root != solPtr) { - encWord((Eleaf *) treep->root); - } else { - /* output sol, a zero-length word */ - /* bits |= 0<ofn[-1].Tcount) - crash("Sort failed"); - nxtIntrn=intrns; - bitsCode=0L; - mf=mfn=of=f; - for (treeCount=ofn-f; --treeCount; ) { - register tree *p; - decision = of==ofn || mf==mfn; - if (of==ofn || (mf!=mfn && (mf->Tcount)<(of->Tcount))) { - p=mf++; - decision += 2; - } else { - p=of++; - } - nxtIntrn->left = p->root; - mfn->Tcount = p->Tcount; - if (of==ofn || (mf!=mfn && (mf->Tcount)<(of->Tcount))) { - p=mf++; - decision += 4; - } else { - p=of++; - } - nxtIntrn->right = p->root; - mfn->Tcount += p->Tcount; - mfn->root = (nodePtr) nxtIntrn; - if (mfn->TcountTcount && treeCount!=1) { - diagfmt("Counter overflow. %d trees left", treeCount); - error((char *) null); - } - bitsCode += mfn->Tcount; - if ((decision&1)==0) { - if (decision == 2) { - /* turn into decision=4 */ - nxtIntrn->right = nxtIntrn->left; - nxtIntrn->left = p->root; - decision=4; - } - if (decision==4) { - /* bits |= 0<root; - { - register Eleaf *p; - for (p = (Eleaf *) f; p!=lr; p++) - p->code = -1L; -} - assignCodes(topNode,0L,(unsigned)0); - return bitsCode/byteLen; -} static int compar(p,q) - pointer p; - pointer q; -{ - if (((tree *)p)->Tcount < ((tree *)q)->Tcount) - return -1; - else if (((tree *)p)->Tcount > ((tree *)q)->Tcount) - return 1; - else - return 0; -} static void encWord(w) - Eleaf *w; -{ - register char *p; - if (w == (Eleaf *)solPtr) - p = nxtChar; /* output ephemeral word */ - else - p=words[(w-leaves)-(solitary+1)]; - do { - bits |= aToWc[*p&asciiMask] << nxtBit; - gotBits(wcLen); - } while ((*p++ & aFlag) == 0); - /* bits |= 0<maxCodeLen) { - maxCodeLen=cl; - if (cl>longLen-1-lenLen) - error("A code would be too long"); - } - ((Eleaf *)node)->code = c<left, c, cl+1); - assignCodes(((intrn *) node)->right, c | 1L<=charCeil) { -#ifdef USEMALLOC - error("out of fixed memory allocation"); -#else /*!USEMALLOC*/ - memRoof += charChunk; - if (brk((pointer) memRoof) == -1) - error("Out of memory"); - charCeil=charRoof-1; -#endif /*!USEMALLOC*/ - } - *p++ = c; - needByte(); - c = *inpos++; - if ((c&~asciiMask)!=0 || aToWc[c]==0) - break; - i = i*hashMult+c; /* ignore overflow */ - } - inpos--; - p[-1] |= aFlag; - p[0] = 0; /* force string compare termination */ - endWord=p; - symLookups++; - step = (i&hashMask) + 1; - /* note: signed divide faster than unsigned on VAX */ - i = ((int)(i & ~(-1<<((sizeof i)*byteLen - 1)))) % maxWords; - for (;;) { - hashProbes++; - if ((p=words[i]) == null) { - if (wordsLeft == 0) { - oflowChars += endWord-nxtChar; - return (Eleaf *)solPtr; - } - words[i]=nxtChar; - nxtChar=endWord; - wordsLeft--; - break; - } - if (*p == *nxtChar) { - inp=nxtChar; - do; while(*++p == *++inp); - /* note: *endWord=='\0' but *p cannot */ - if (inp==endWord) - break; - } - i += step; - if (i >= maxWords) - i -= maxWords; - } - i += maxSpecs+1; - } else { - if (inpos>inend) { - /* inpos--; */ - return leaves+eof; /* avoid lookahead */ - } else { - needByte(); - switch (c) { - case ';': - if (*inpos!='\n') - break; - inpos++; - /* FALLTHROUGH */ - case '\n': - inLine++; - for (i=0; ; i++) { - needByte(); - if (*inpos!='\t' || i>=maxTabs) - break; - inpos++; - } - c = (c==';'? scNlTabs : nlTabs)+i-curTabs; - curTabs=i; - break; - case ',': - if (*inpos==' ') { - inpos++; - c=commaSp; - } else if (*inpos=='\n') { - inpos++; - inLine++; - c=commaNl; - } - break; - case '.': - if (*inpos=='\n') { - inpos++; - inLine++; - c=perNl; - } - } - } - needByte(); - i=aToWc[*inpos&asciiMask]!=0? c+wfBias : c; - } - return leaves+i; -} -/* Decoder dynamic memory structures: - * - * intrn intrns[numSpecs+numWords]; - * intrns. - * char chars[num_chars]; - * chars, eofPtr, solPtr, nxtChar, - * charRoof (synonym of memRoof). - */ -#define leapLen 8 struct leap { /* how to leap to right part of decoding tree */ - nodePtr leapTo; - int leapDepth; -}; static struct leap specLeap[1<>= nwLen-byteLen; - needBits(specLen); - numSpecs=(bits&specMask)+1; - bits >>= specLen; - if ((unsigned)c!=magicNum || numSpecs>maxSpecs) - error("Input is not Hufman encoded"); - intrns = (intrn *) mem; - chars = (char *) &intrns[numSpecs+numWords]; - eofPtr = (nodePtr) &chars[eof]; - solPtr = (nodePtr) &chars[solitary]; - nxtChar = (char *)solPtr + 1; - if (memRoofleft = (nodePtr) &chars[bits&specMask]; - bits >>= specLen; - node = (nodePtr) (((intrn *) node) + 1); - } -#ifdef DEBUG - diagfmt("got spec list %d %d %o\n",inpos-inbuf,nxtBit,bits); - diagpr(); -#endif /*DEBUG*/ - for (c=numWords; c!=0; c--) { - ((intrn *) node)->left = (nodePtr) nxtChar; - nxtChar=decWord(); - if ((nodePtr)nxtChar == ((intrn *) node)->left) - ((intrn *) node)->left = solPtr; - node = (nodePtr) (((intrn *) node) + 1); - } -#ifdef DEBUG - diagfmt("got word list %d %d %o\n",inpos-inbuf,nxtBit,bits); - diagpr(); -#endif /*DEBUG*/ - getTree(intrns,intrns+numSpecs,specLeap); -#ifdef DEBUG - diagfmt("got spec tree %d %d %o\n",inpos-inbuf,nxtBit,bits); - diagpr(); -#endif /*DEBUG*/ - if (numWords!=0) - getTree(intrns+numSpecs,(intrn *)node,wordLeap); -#ifdef DEBUG - diagfmt("got word tree %d %d %o\n",inpos-inbuf,nxtBit,bits); - diagpr(); -#endif /*DEBUG*/ - needBits(1); - if ((bits&1)==0) - leap=specLeap; - else - leap=wordLeap; - bits >>= 1; - for (;;) { -#ifdef DEBUG - diagfmt("Node = 0%o\n",node); - diagpr(); -#endif /*DEBUG*/ - c = bits; - if (nxtBit < leapLen) { - needByte(); - c |= (*inpos++ & byteMask) << nxtBit; - nxtBit += byteLen; - } - leap += c & ((1<>= leap->leapDepth; - nxtBit -= leap->leapDepth; - node = leap->leapTo; - while (node<(nodePtr)chars) { - if (--nxtBit<0) { - needByte(); - c = *inpos++ & byteMask; - nxtBit=byteLen-1; - } - if ((c&1)==0) - node=((intrn *) node)->left; - else - node=((intrn *) node)->right; - c >>= 1; - } - bits = c; - if (node >= solPtr) { - if (node == solPtr) { - /* decode the word, but don't keep it */ - (void) decWord(); - node = (nodePtr) nxtChar; - } - for (;;) { - outch(*(Dleaf *)node & asciiMask); - if ((*(Dleaf *)node & aFlag) != 0) - break; - node = (nodePtr) ((Dleaf *) node + 1); - } - leap = specLeap; - } else { - if (node == eofPtr) - break; - c=((Dleaf *)node - chars)&~wfBias; - if (aToWc[c]) { - if (c==perNl) { - outch('.'); - outch('\n'); - } else if (c==commaSp) { - outch(','); - outch(' '); - } else if (c==commaNl) { - outch(','); - outch('\n'); - } else { - if (nlTabs-maxTabs<=c - && c<=nlTabs+maxTabs) { - c -= nlTabs; - } else { - c -= scNlTabs; - outch(';'); - } - outch('\n'); - curTabs = c += curTabs; - while (c--) - outch('\t'); - } - } else { - outch(c); - } - if (node < (nodePtr) (chars+wfBias)) - leap = specLeap; - else - leap = wordLeap; - } - } - if (inposof) - crash("Tree builder"); - if (of!=ofn && mf!=mfn) { - /* both lists are non-empty, so read "decision" */ - needBits(1); - if ((bits&1)==0) { - decision=4; - } else { - bits >>= 1; - needBits(1); - decision = (bits&1)==0? 0 : 6; - } - bits >>= 1; - } - if (of==ofn || (mf!=mfn && (decision&2)!=0)) - p = (nodePtr) mf++; - else - p = (of++) -> left; - mfn->left = p; - if (of==ofn || (mf!=mfn && (decision&4)!=0)) - p = (nodePtr) mf++; - else - p = (of++) -> left; - mfn->right = p; - mfn++; - } - buildLeap(leapTable, 1, of!=ofn? of->left : (nodePtr) mf, 0); -} static void buildLeap(table, bit, nodeArg, depth) - register struct leap *table; - register int bit; - nodePtr nodeArg; - int depth; -{ - register nodePtr node; - register int i; - node = nodeArg; - while (bit!=(1<left, depth); - node = ((intrn *) node)->right; - table += bit; - bit <<= 1; - } - for (i=0; i!=(1<leapTo = node; - table->leapDepth = depth; - table += bit; - } -} static char * decWord() -{ - register char *p; - register char c; - p=nxtChar; - for (;;) { - if (nxtBit < wcLen) { - needByte(); - bits |= (*inpos++ & byteMask) << nxtBit; - nxtBit += byteLen; - } - c=bits&wcMask; - bits >>= wcLen; - nxtBit -= wcLen; - if (c==0) { - if (p!=nxtChar) - p[-1] |= aFlag; - return p; - } - if (p>=charRoof) { -#ifdef USEMALLOC - error("out of fixed memory allocation"); -#else /*!USEMALLOC*/ - memRoof += charChunk; - if (brk((pointer)memRoof) == -1) - error("Out of memory for characters"); -#endif /*!USEMALLOC*/ - } - *p++ = wcToA[c]; - } -} ! echo shar unpacked fully exit