Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!jwmills From: jwmills@iuvax.cs.indiana.edu (Jonathan Mills) Newsgroups: comp.arch Subject: Re: "General-purpose" architectures and symbolic languages Message-ID: <17833@iuvax.cs.indiana.edu> Date: 22 Feb 89 14:37:23 GMT References: <28113@ucbvax.BERKELEY.EDU> Reply-To: jwmills@iuvax.cs.indiana.edu (Jonathan Mills) Organization: Indiana University, Bloomington Lines: 169 LIBRA: LOGICAL INFERENCE BALANCED RISC ARCHITECTURE (name by Tony Faustini) The LIBRA is a 40-bit 4-stage pipelined processor with four major functional units, each pipelined & operating synchronously in parallel: 1. Value ALU. Contains separate hardware for arithmetic/logic, dereferencing, bounds checking. 2. Tag ALU. Tag comparison, interface to partial unifier. 3. GC ALU. Support for mark&sweep GC algorithms 4. PC ALU. Next instruction fetching; many branches are simple counter loads, but of a partial field. Fastest branches are within a page (yes this is a paged machine, at least for now. It doesn't bother the compiler writer I work with, but I do recognize the problems for multi-processing.) The machine is microcoded, but uses only one control word per machine instruction. An alternate microcode address is latched after every instruction that sets condition codes (more on this in partial unify). The 40-bit data word is divided into a 3-bit type, 1 bit each for mark & reverse garbage collect flags, and a 35-bit value which can be further typed for use with 32-bit numeric host processors. NON-ORTHOGONAL REGISTER FILE There are 32 registers. 4 are stack pointers (see why in a minute), 24 are general purpose, and 4 are link registers for subroutine calling (or continuation pointers a la the WAM). The register bank is non-orthogonal to allow single cycle instructions that can: push register A using register B as a pointer to memory, increment/decrement register B, store register B into register C, overwriting the tag in C with an immediate. These instructions make use of parts of the datapath that would be idle that are provided for forwarding, and routes their contents back to the register file, which is useful in a structure-copying implementation of Prolog. PARTIAL UNIFY Microcode addressing is typically provided or modified by flags, fields in the microinstruction, and a microprogram counter. In the LIBRA concatenated-tags are used for microcode addressing. Using this technique, the microcode is viewed as a two-dimensional array of control words. The two tags saved from the previous instruction's operands are used to index the microcode array, and select the control word for a replaceable instruction. Single-cycle partial unify performs either a nop, a store, a call or a branch, thus condensing three to five tag checking instructions into one. The partial unify instruction can eliminate as many as 30% of the subroutine calls performed by a general purpose RISC running Prolog. By applying this operation to the LIBRA's partial unification and switch instructions, dereferencing can be made a conditional operation, performed only when necessary. More significantly, the pair of checks for a bound variable (one is needed for each operand being unified) can be included in the operation of the partial unification instruction by simply enlarging the unifier table in ROM, thus removing as many as eight instructions (which include two branches) from the direct execution sequence. The modified partial unify instruction thus reduces the number of instruction cycles spent dereferencing objects reached in eight or fewer indirections; this optimization improves the performance of unification from 60% to 100% by conditionally omitting dereferencing as well as explicit checks for bound variables. Here's an example: Label Condition Instr sc Operands Comment -------------------------------------------------------------------- loop: if bound1 ld Xn 0 Xn if bound2 ld Ai 0 Ai enter: sub sc Xn Ai r27 unify sc Xn Ai loop exit ;no mode split if trail1 push+ TR Xn if trail2 push+ TR Ai exit: SMART CACHE Tags can also be used to improve cache hit rates by preventing transient data, such as intermediate elements in a pointer chain, from entering a cache. This implies that dereferencing should take place outside the cache, rather than at the chip (which was suggested in 1986 by Mats Carlson). If a "dereferenced choice point" is built once for a procedure, then the bottleneck that this would otherwise form can be reduced. Also, the only hardware interlock built into the LIBRA is used to stall the pipeline during dereferencing. Moving dereferencing into the cache would clean up the design. It is also possible to do a CDC-6600 like trick, and move trailing and failing out of the CPU to a peripheral processor (also done by McCabe in the DLM). If cache entries are monitored by the trail-fail processor, and reset if requested by the CPU, the failure of a previous goal can be overlapped with the execution of the subsequent goal, thus improving backtracking performance. SYMBOLIC BOUNDS CHECKING & TRAIL-CHECK SCOREBOARDING Automatic bounds check are provided for trail and current environment checking. Some bounds checks are "sticky", to allow detection of an "almost stack collision" condition. Then garbage collection can be initiated at programmer defined points. This saves stack collision checking at every call, and further reduces branching. Scoreboarding is generalized to include marking pre-processed data as well as data waiting to be processed. The trail check is moved to occur during a load or the final stage of a dereference. If the check showed that an unbound variable is being loaded, and it needs to be trailed if it is instantiated later, the register loaded is marked by setting a trail-check flag in the scoreboard. SYMBOLIC CONDITIONING OF EVERY INSTRUCTION Conditional instruction execution is used to implement preferred branches and is also used with numeric conditions to control the execution of every instruction in the Acorn RISC Machine (Acorn Computers Limited 1986, 1987). The LIBRA architecture extends this concept by adding symbolic conditions, allowing operations which formerly needed several test-and-branch instructions to be coded instead as a sequence of instructions, all of which are conditionally nops. This reduces the branch frequency of the LIBRA, and improves its ability to use interleaved memory. Symbolic conditions include certain variable types (bound1, bound2) and trail and environment checks (trail1, trail2, env1, env2), and collision checks ("sticky overflow"). Overall, the use of conditional instruction execution, memory interleaving, tag-controlled caching, instruction prefetching and loop-buffering (lookahead, look-behind) allow the LIBRA to execute short inner loops and shallow backtracking at speeds limited primarily by the cache memory cycle time rather than the hit rate. A GENERAL PURPOSE ARCHITECTURE? Well, the LIBRA was compared only for Prolog, but it was compared to the PLM, the SPUR, the 68K and the 88K, and outperformed them all. Truly, Peter VanRoy's assertion that general purpose architectures should support symbolic processing is one I have been pushing for years. The argument against it that has always been thrown back at me is, "We will have 50 MIPS machine in a year or two; the performance advantage of a special purpose processor is not worth it." I disagree with that rebuttal, and consider Motorola's interest (which varies over time) to indicate that it might - just might - be possible to extract the essentials of the LIBRA and put them into place as an SFU in a 88000. Toward this end I have designed a 32-bit LIBRA with 16 instructions, that encodes the WAM in 2.16x fewer cycles than the PLM. As an SFU this beast would be capable of interpreting 32-bit data as tagged cells, would be able to take advantage of all of the 88K numeric and O/S specific support, and thus could deal with symbolic processing variants such as Constraint Logic Programming, as well as Prolog (and presumably Lisp - and after talking to the Schemers here at IU I can find nothing they do that the LIBRA can't encode). CLOSING WORD: I have probably left out some minor but interesting things, but if you are really interested & are free early this summer I will give a 5-week seminar beginning around May 20th at IU on symbolic architectures, with an emphasis on their use in automated deduction systems. E-mail me for the details.