Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site fortune.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!mhuxl!ihnp4!fortune!rpw3
From: rpw3@fortune.UUCP
Newsgroups: net.arch
Subject: Re: byte alignment - (nf)
Message-ID: <3232@fortune.UUCP>
Date: Fri, 4-May-84 07:02:20 EDT
Article-I.D.: fortune.3232
Posted: Fri May  4 07:02:20 1984
Date-Received: Sat, 5-May-84 01:33:08 EDT
Sender: notes@fortune.UUCP
Organization: Fortune Systems, Redwood City, CA
Lines: 57

#R:wateng:-96500:fortune:16500013:000:2796
fortune!rpw3    May  4 01:35:00 1984

Speaking of "Stretch" and "bytes", don't forget the venerable PDP-10 and
DECsystem-20 have the same sort of "bytes" -- 1 to 36 bits wide, specified in
the "byte pointer". (I suspect the idea was copied from the IBM predecessors,
since a PDP-10 looks a lot like a cleaned-up IBM 7094.)

The nice thing about PDP-10/20 byte pointers is that they can directly
implement character set mapping via "byte strips". If you partition
the ASCII set into (say) 3 sets (for example, letters/numbers, delimiters,
and "ignore"), you can build a byte pointer that points to a two-bit wide
strip in a 128-word array (or 129 or 256 or 257, depending on taste).
This is a generalization of the trick used in the 'stdio' array "ctype",
but "ctype" uses only single-bit strips. (See CACM, long ago, Jim Gimpell (sp?)
of Bell Labs, "Minimization of Spatially-Multiplexed Character Sets", which
talked about the bit-strips [NOT byte-strips!] used in the SNOBOL-IV lexical
analyzer.)

To make it clearer (?), suppose the character array is called CTYPE (why not?),
and the set we are interested is called LETDEL (6-char limit, on the -10),
and it occupies bits 31 and 32 (the -10 is a "big Endian" machine, so
bit 0 is MSB and bit 35 is LSB), and suppose we always have the character
to be tested in accumulater (register) "C" (programmer-defined symbol),
then the code to do a three-way branch based of which subset the character
belongs to is the following two instructions (I have used UNIX-like syntax
for non-"10" readers):

+--
| 	ldb	t1,letdel	; get two-bit field into register t1
| 	jrst	@.+1(t1)	; dispatch via jump-table
| 	.word	is_ign		;   C is in the "ignore" set, go to is_ign
| 	.word	is_let		;   C is in the "letters & numbers" set
| 	.word	is_del		;   C is in the "delimeters" set
| 	.word	panic		;   can never happen if we built "ctype" right!
| 
|letdel: point	2,ctype(C),32	; same as 32<<30 + 2<<24 + C<<18 + ctype
+--

Using such byte strips (which are only marginally less efficient even on
hardware without variable-sized bytes), VERY FAST lexical analyzers can be
built, fast enough that character-at-a-time interpretation of interactive
languages (such as FOCAL or *shudder* BASIC) is not unreasonable. In a port
of FOCAL to the PDP-10 which used this, the parsing was NOT the critical
time item (symbol table search was).

This is a case in which efficient HARDWARE architecture on an early machine
can suggest efficient SOFTWARE on later machines, even though they may not
have the special hardware of the earlier machine. For example, the "bit-blt"
hardware of the Xerox "Alto" and the software of the Apple "Mac".

Rob Warnock

UUCP:	{ihnp4,ucbvax!amd70,hpda,harpo,sri-unix,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065