Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!imagen!atari!portal!cup.portal.com!Robert_Bob_Freed From: Robert_Bob_Freed@cup.portal.com Newsgroups: rec.games.misc,comp.sys.apple Subject: Re: zork decoding Message-ID: <1425@cup.portal.com> Date: Thu, 12-Nov-87 20:20:09 EST Article-I.D.: cup.1425 Posted: Thu Nov 12 20:20:09 1987 Date-Received: Sun, 15-Nov-87 11:55:41 EST References: <2804@batcomputer.tn.cornell.edu> <7639@reed.UUCP> Organization: The Portal System (TM) Lines: 174 Xref: mnetor rec.games.misc:1153 comp.sys.apple:3294 XPortal-User-Id: 1.1001.2659 In article <2804@batcomputer.tn.cornell.edu>, saponara@batcomputer.tn.cornell.edu (John Saponara) writes: > A few weeks ago someone asked how to decode Infocom's "Zork" data... Attention Zork addicts and dedicated Infocommies: This posting has touched off a barage of speculative replies and vague recollections, but no definitive answers as to how Infocom encodes textual data in its game files. Considering the amount of apparent interest in this topic, here are the EXACT details, both general and specific. I hope this will lay to rest this subject (if such is ever possible on the net). General: Infocom text is not encrypted. Rather, it is coded using a packing scheme that results in a high degree of data compression. (Although, considering the difficulty people have found in decoding text, the scheme functions as a pretty fair encryption mechanism also.) Text is encoded using a 5-bit, 3-level code. Although the 5-bit code permits only 32 characters, 26 of these are multiplexed three ways by means of two "shift" codes, resulting in a 78-character set. This includes the 52 letters of the upper/lower-case alphabet, 10 digits, and 15 common punctuation characters, including New Line. Also, one code is used as an escape prefix for a 3-code sequence to specify any arbitrary 7-bit ASCII character not included in the set. Four codes are common to all three levels. One of these is a space character. The remaining three codes are prefixes for a 2-code token, which references one of 96 common text substrings. These substrings are packed using the same encoding scheme and are referenced by means of a pointer table, the address of which is at a fixed location in the game file. The particular set of 96 token substrings is game-dependent and is chosen automatically by the Infocom game language compiler for maximum overall compression, via analysis of all text strings in a game. Output of a token substring involves a single level of recursion by the text string output processor. Text strings are packed into 16-bit words, with three 5-bit codes per word. The extra bit is used as an end-of-string flag. This same packing scheme is used for ALL text, including both game output and the vocabulary table used to match user input. The same scheme is used by ALL Infocom games, including the newer Interactive Fiction "Plus" games, such as Bureaucracy and Beyond Zork. I have personally disassembled many versions of the Infocom interpreter program for several different computer systems, and I can thus verify the accuracy of this description. Note for would-be game hackers: The interpreter programs (e.g. the .COM files provided for CP/M and MS-DOS systems) are game-independent. All game-specific text is contained within the game data (e.g. .DAT) files, and is typically processed "on-the-fly" by the interpreter program. Thus, you cannot "cheat" by disassembling the interpreter itself, and examining memory while the interpreter is running will yield at most the last (bufferred) line of output text in ASCII code. Specific: Infocom game files consist of an ordered, addressable sequence of 8-bit bytes (maximum 128K bytes in the "classic" games, 256K bytes in the newer "plus" games). Text strings are packed into 16-bit words, with the most significant byte first (lower address) in the game file. Text strings may (and do) start at arbitrary (odd or even) byte addresses. The msb of each word is 0 in each intermediate word of a string, 1 in the final word. This is followed by three 5-bit codes, which are processed in left-to-right order. I.e., numbering bits 15-to-0 from msb-to-lsb: Bit 15 = end-of-string flag Bits 14-10 = code 1 Bits 9-5 = code 2 Bits 4-0 = code 3 Codes are interpreted at one of three levels, which we denote 0, 1, 2: Level 0 = lower-case alphabet Level 1 = upper-case alphabet Level 2 = digits and punctuation Level 0 is the default (initial) level at the start of a string, and at the start of a tokenized substring (processed recursively from the main string). Codes 0-3 are common to all levels, and do not affect the current level: Code 0 = space character Code 1 = prefix for token substrings 0-31 (next code) Code 2 = prefix for token substrings 32-63 (next code + 32) Code 3 = prefix for token substrings 64-95 (next code + 64) The token substrings are normally stored starting at byte address 40 hex in the game file, but these are properly accessed via a 96-word pointer table whose address is contained in the word at bytes 18-19 hex. All token substrings begin on even byte addresses, and the pointer table contains the substring word addresses (i.e. byte addresses divided by two). All 16-bit addresses and pointers are stored high-byte-first. Codes 4-5 are used to shift levels: Code 4, level 0 = shift to level 1 for next code only Code 5, level 0 = shift to level 2 for next code only Code 4, level 1 = permanent shift to level 1 Code 5, level 1 = permanent shift back to level 0 Code 4, level 2 = permanent shift back to level 0 Code 5, level 2 = permanent shift to level 2 Note that, from level 0, the shift codes affect the NEXT code only. (Code 4 normally precedes the first, capitalized word of a sentence.) Two identical shift codes in a row effect a shift-lock to a new level, and the alternate shift code is then used to restore level 0. Code 5 is also used to end-pad the last word of a text string which does not contain a multiple of three codes. Finally, codes 6-31 generate printable characters: Code Level 0 Level 1 Level 2 ---- ------- ------- ------- 6 a A (see below) 7 b B New Line 8 c C 0 9 d D 1 10 e E 2 11 f F 3 12 g G 4 13 h H 5 14 i I 6 15 j J 7 16 k K 8 17 l L 9 18 m M . 19 n N , 20 o O ! 21 p P ? 22 q Q _ 23 r R # 24 s S ' 25 t T " 26 u U / 27 v V \ 28 w W - 29 x X : 30 y Y ( 31 z Z ) Code 6 at level 2 is a special escape prefix, which may be used to generate an arbitrary ASCII code via a 3-code sequence. This is specified by the next two following codes, which contain the high two bits and low five bits, respectively, of the desired 7-bit ASCII code. Note that SOME text output is generated by single ASCII codes, which are not packed using the scheme described here. However, the majority of all textual data, including the input vocabulary, is packed. (I'll reserve a description of how to locate the input vocabulary table for a later posting, if there is sufficient interest in same.) In conclusion: I hope all this satisfies some curiosity, particularly from the standpoint of understanding an interesting and effective text compression technique. I personally fail to see how anyone could enjoy an Infocom story by using this information to (partially) decode the game data file. But to each his own. Happy adventuring! -- Bob Freed Internet: Robert_Freed@cup.portal.com Uucp: sun!portal!cup.portal.com!Robert_Freed