Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!jwmills
From: jwmills@iuvax.cs.indiana.edu (Jonathan Mills)
Newsgroups: comp.arch
Subject: Re: "General-purpose" architectures and symbolic languages
Message-ID: <17833@iuvax.cs.indiana.edu>
Date: 22 Feb 89 14:37:23 GMT
References: <28113@ucbvax.BERKELEY.EDU>
Reply-To: jwmills@iuvax.cs.indiana.edu (Jonathan Mills)
Organization: Indiana University, Bloomington
Lines: 169

LIBRA: LOGICAL INFERENCE BALANCED RISC ARCHITECTURE (name by Tony Faustini)

The LIBRA is a 40-bit 4-stage pipelined processor with four major 
functional units, each pipelined & operating synchronously in parallel:

	1. Value ALU.  Contains separate hardware for arithmetic/logic, 	
			   dereferencing, bounds checking.
	2. Tag ALU.	   Tag comparison, interface to partial unifier.
	3. GC ALU.	   Support for mark&sweep GC algorithms
	4. PC ALU.	   Next instruction fetching;  many branches are simple
			   counter loads, but of a partial field.  Fastest 	
			   branches are within a page (yes this is a paged 	
			   machine, at least for now.  It doesn't bother the 	
			   compiler writer I work with, but I do recognize the 
			   problems for multi-processing.)

The machine is microcoded, but uses only one control word per machine 
instruction.  An alternate microcode address is latched after every 
instruction that sets condition codes (more on this in partial unify).

The 40-bit data word is divided into a 3-bit type, 1 bit each for mark & 
reverse garbage collect flags, and a 35-bit value which can be further 
typed for use with 32-bit numeric host processors.


NON-ORTHOGONAL REGISTER FILE

There are 32 registers.  4 are stack pointers (see why in a minute), 24 are
general purpose, and 4 are link registers for subroutine calling (or 
continuation pointers a la the WAM).  The register bank is non-orthogonal 
to allow single cycle instructions that can:

	push register A using register B as a pointer to memory,
	increment/decrement register B,
	store register B into register C, overwriting the tag in C with an
		immediate.

These instructions make use of parts of the datapath that would be idle 
that are provided for forwarding, and routes their contents back to the 
register file, which is useful in a structure-copying implementation of 
Prolog.


PARTIAL UNIFY

Microcode addressing is typically provided or modified by flags, fields in 
the microinstruction, and a microprogram counter.  In the LIBRA 
concatenated-tags are used for microcode addressing.  Using this technique, 
the microcode is viewed as a two-dimensional array of control words.  The 
two tags saved from the previous instruction's operands are used to index 
the microcode array, and select the control word for a replaceable 
instruction.  Single-cycle partial unify performs either a nop, a store, a 
call or a branch, thus condensing three to five tag checking instructions 
into one.  The partial unify instruction can eliminate as many as 30% of 
the subroutine calls performed by a general purpose RISC running Prolog.

By applying this operation to the LIBRA's partial unification and switch 
instructions, dereferencing can be made a conditional operation, performed 
only when necessary.  More significantly, the pair of checks for a bound 
variable (one is needed for each operand being unified) can be included in 
the operation of the partial unification instruction by simply enlarging 
the unifier table in ROM, thus removing as many as eight instructions 
(which include two branches) from the direct execution sequence.  The 
modified partial unify instruction thus reduces the number of instruction 
cycles spent dereferencing objects reached in eight or fewer indirections;  
this optimization improves the performance of unification from 60% to 100% 
by conditionally omitting dereferencing as well as explicit checks for 
bound variables. Here's an example:


Label	Condition	Instr	sc	Operands				Comment
--------------------------------------------------------------------
loop:	if bound1	ld		Xn	0	Xn
	if bound2	ld		Ai	0	Ai
enter:		sub	sc	Xn	Ai	r27
			unify	sc	Xn	Ai	loop	exit	;no mode split
	if trail1	push+		TR	Xn
	if trail2	push+		TR	Ai
exit:


SMART CACHE

Tags can also be used to improve cache hit rates by preventing transient 
data, such as intermediate elements in a pointer chain, from entering a 
cache.  This implies that dereferencing should take place outside the 
cache, rather than at the chip (which was suggested in 1986 by Mats 
Carlson).  If a "dereferenced choice point" is built once for a procedure, 
then the bottleneck that this would otherwise form can be reduced.  Also, 
the only hardware interlock built into the LIBRA is used to stall the 
pipeline during dereferencing.  Moving dereferencing into the cache would 
clean up the design.

It is also possible to do a CDC-6600 like trick, and move trailing and
failing out of the CPU to a peripheral processor (also done by McCabe in 
the DLM).  If cache entries are monitored by the trail-fail processor, and 
reset if requested by the CPU, the failure of a previous goal can be 
overlapped with the execution of the subsequent goal, thus improving 
backtracking performance.


SYMBOLIC BOUNDS CHECKING & TRAIL-CHECK SCOREBOARDING

Automatic bounds check are provided for trail and current environment 
checking. Some bounds checks are "sticky", to allow detection of an "almost 
stack collision" condition.  Then garbage collection can be initiated at 
programmer defined points.  This saves stack collision checking at every 
call, and further reduces branching.

Scoreboarding is generalized to include marking pre-processed data as well 
as data waiting to be processed.  The trail check is moved to occur during 
a load or the final stage of a dereference.  If the check showed that an 
unbound variable is being loaded, and it needs to be trailed if it is 
instantiated later, the register loaded is marked by setting a trail-check 
flag in the scoreboard.


SYMBOLIC CONDITIONING OF EVERY INSTRUCTION

Conditional instruction execution is used to implement preferred branches 
and is also used with numeric conditions to control the execution of every 
instruction in the Acorn RISC Machine (Acorn Computers Limited 1986, 1987).  
The LIBRA architecture extends this concept by adding symbolic conditions, 
allowing operations which formerly needed several test-and-branch 
instructions to be coded instead as a sequence of instructions, all of 
which are conditionally nops.  This reduces the branch frequency of the 
LIBRA, and improves its ability to use interleaved memory.  Symbolic 
conditions include
certain variable types (bound1, bound2) and trail and environment checks 
(trail1, trail2, env1, env2), and collision checks ("sticky overflow").

Overall, the use of conditional instruction execution, memory interleaving, 
tag-controlled caching, instruction prefetching and loop-buffering 
(lookahead, look-behind) allow the LIBRA to execute short inner loops and 
shallow backtracking at speeds limited primarily by the cache memory cycle 
time rather than the hit rate.


A GENERAL PURPOSE ARCHITECTURE?

Well, the LIBRA was compared only for Prolog, but it was compared to the 
PLM, the SPUR, the 68K and the 88K, and outperformed them all.  Truly, 
Peter VanRoy's assertion that general purpose architectures should support 
symbolic processing is one I have been pushing for years.

The argument against it that has always been thrown back at me is, "We will 
have 50 MIPS machine in a year or two; the performance advantage of a 
special purpose processor is not worth it."

I disagree with that rebuttal, and consider Motorola's interest (which 
varies over time) to indicate that it might - just might - be possible to 
extract the essentials of the LIBRA and put them into place as an SFU in a 
88000.

Toward this end I have designed a 32-bit LIBRA with 16 instructions, that 
encodes the WAM in 2.16x fewer cycles than the PLM.  As an SFU this beast
would be capable of interpreting 32-bit data as tagged cells, would be able 
to take advantage of all of the 88K numeric and O/S specific support, and 
thus could deal with symbolic processing variants such as Constraint Logic 
Programming, as well as Prolog (and presumably Lisp - and after talking to 
the Schemers here at IU I can find nothing they do that the LIBRA can't 
encode).

CLOSING WORD:  I have probably left out some minor but interesting things, 
but if you are really interested & are free early this summer I will give a 
5-week seminar beginning around May 20th at IU on symbolic architectures, 
with an emphasis on their use in automated deduction systems.  E-mail me 
for the details.