Path: utzoo!mnetor!uunet!husc6!bbn!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.arch
Subject: The WM Machine
Message-ID: <5339@aw.sei.cmu.edu>
Date: 4 May 88 20:04:25 GMT
Sender: netnews@sei.cmu.edu
Lines: 285

This post is a fairly long review of the WM machine recently
described in Computer Architecture News.

It is about 280 lines long, and follows this form feed.  It
assumes (a) you have read the paper, and (b) care tuppence
about my thoughts.

The opinions, misunderstandings, errors &c are mine alone.

	The WM Computer Archicecture
	============================

	[Wm A Wulf, Computer Architecture News, 16 (1988) pp 70..84]

	Comments by Robert Firth, CMU-SEI, May 1988


Introduction
------------

The WM computer architecture, described in the cited paper, is a sketch
of part of the order code of a new computer processor.  This processor
is designed, it its author's words, to retain the RISC philosophy while
achieving significantly greater performance than other RISC designs.

This Note comments on the major features of the architecture as described,
and reflects on their possible effectiveness.


Major Features
--------------

The major features of the WM architecture are

(a) a load/store design

(b) loads and stores use implied FIFO operand and result queues

(c) operations use registers (or FIFO) for operands and results

(d) each instruction performs TWO operations on THREE operadns,
    after the pattern

	result := rand1 rator1 (rand2 rator2 rand3)

(e) relational operators return their right operand as result and
    load a condition into a FIFO; conditional jumps read the condition FIFO

There are also facilities for a floating point coprocessor, a "stream
mode" load or store, and multiple load FIFOs.

As shown, the design allows for 32 named integer registers (of which one
or two may in fact be FIFOs), 32 floating registers (similarly), 16
general integer operations, 16 general floating operations, and 32 "special"
operations.  The paper does not describe any of these operations, but by
implication they include arithmetic, logical, and relational.


General Observations
--------------------

The first observation is that the paper presents only a sketch.  There is
no detail of any of the operations, no detail on the special instructions
other than a brief account of the "stream" loads, and no description of
even so important an operation as the subroutine call.

Moreover, while the claims for the machine are made in terms of execution
speed, data access speed, and so on, the presentation is purely at the
level of the instruction set architecture; there are no timing diagrams,
data flow diagrams, or other lower-level detail.

This incompleteness makes it hard to assess the design; nevertheless, the
following sections offer some critical commentary, on specific and more
general features.

The next observation is clear from Figure 1 of the reference: the instruction
set space is almost saturated.  As Bell and Newell remark, this is a mistake
in the design of a new machine, since it leaves no room for extensions or
afterthoughts.  In detail: the instruction class is indicated by the
leading four bits, and all combinations are used.  In all classes except
2#1110# (special), the remaining bit patterns are almost fully allocated.
The special class admits 32 operations; the text implies that a large
number are already allocated.


Load/Store
----------

A load/store design serves two purposes

(a) it simplifies the instruction set by allowing general addresses to
    appear in only a few places. In particular, it usually allows the
    instruction encoding to use a single uniform size.

(b) it improves performance by uncoupling the fetch of a value from its
    use, thereby allowing the fetch delay to be overlapped with useful
    work.

Previous RISC experience shows that these purposes are well served by this
design, and there seems little reason to doubt the WM design will work as
well.


FIFOs
-----

The use of FIFO queues, however, is more problematical.  There is some
gain in instruction density, since the destination of a load, or the
source of a store, need not be specified.  But this is a very small gain,
since most of a load or store instruction is taken up with the memory
address.

I do not believe there is any performance gain.  If loads or stores can
be overlapped, it is relatively easy for a compiler to target successive
loads into different registers.  With 32 general registers available,
there will surely be enough for any reasonable sequence of loads or
stores.  The one exception - of which there is an example in the paper -
is when the loads are in a tight loop and the load delay runs round the
loop.  But there the rate-determining step is the load, regardless of
how it is buffered, and loop unrolling can almost always avoid load
destination clashes.

Moreover, loads into explicit registers offer several advantages.  First,
the operands need not be used in the order in which they are loaded,
which is often helpful, especially if operations have to be reordered to
take advantage of parallel ALUs.  Secondly, and far more important, the
use of an operand does not destroy it.  It is significant that none of
the examples given in the paper ever uses the same operand twice; such
an example would show clearly how a load FIFO forces an extra instruction
to be generated.

Finally, FIFOs cause severe implementation problems, one of which at least
- efficient saving and restoring of FIFO contents on a task context switch -
has never to my knowledge been satisfactorily solved.  Other serious
problems include resetting the FIFO after an interrupt or exception, and
resuming an orderly flow of loads and stores after a memory-management trap.

In sum, I believe the use of FIFOs offers no advantages over loads and
stores into and out of the general registers, and incurs several
disadvantages.


Operations use Registers
------------------------

This is another piece of the normal RISC approach, and is again probably
a good feature.  However, even a RISC machine must provide operations
directly on memory, for processor synchronization; the absence of any
such instructions in WM might be thought a major oversight.  But it is 
clear from the paper as a whole, that the author is describing really
only that part of the machine that a compiler writer would see.


Two Operations per Instruction
------------------------------

This feature seems to me quite unwarranted.  One major point of a RISC
design is that instructions are simple enough to be executed quickly.
But here is a machine where, within one instruction, there are two
operations, the one gated on the result of the other.  This forces the
instruction to take twice as long, regardless of how many execution
units might be available.

A design in which twice as many instructions are executed, but successive
instructions are independent, clearly cannot perform worse, and will perform
better if several execution units can be run in parallel.  One main lesson
of prior designs - achieve higher parallelism by uncoupling mutually dependent
operations - has simply been ignored.

The WM sketch intends the first half of each instruction to be overlapped with
the second half of the preceding instruction, and there is a dependency
rule about that.  But the same apparatus is all that is needed to execute
independent instructions in parallel, so the implementation complexity of
WM is no less, but its performance is impeded by the extra dependency
within each instruction.


Relational Operations and Conditions
------------------------------------

It is a matter of debate whether condition codes are justified, rather
than having the comparison yield a Boolean result or even drive the
conditional branch directly.  Proponents of RISC would probably tend
to regard condition codes as a needless complication.  If that is so,
then, a fortiori, a condition-code FIFO is a multiply needless complication.

In many years of practical programming, I can recall only a couple of
cases where it would have been useful to enqueue two sets of condition
codes for successive testing.  Accordingly, I must pronounce this feature
useless.  Moreover, there are several RISC machines where a comparison,
test, and jump decision can be done in one cycle.  Therefore, no purpose
is served by separating the comparison from the jump decision, since
there is no latency to be overlapped.  And the branch-delay optimisation
issue is equally present, under a different guise, in the WM machine, as
its author admits.

The presence of yet another FIFO adds yet more complexity to the handling
of traps, interrupts, or context switches.


Address Modes
-------------

As a final point of detail, let us look not at a feature, but at a non
feature.  The WM machine has no address modes.  Instead, the load and
store instructions perform an integer computation, again of the form

	rand1 rator1 (rand2 rator2 rand3)

and the result becomes the effective address for the memory reference.

Consider first, how one might access a simple static variable, at a given
(32-bit absolute) address.  Clearly, one cannot say the equivalent of

	Load @16#abcdefgh#

since the machine does not permit an absolute address as an operand.

Another way, would be to store the address in a literal pool, and access
the operand indirectly, by a PC-relative or register-based mode.  This
is the strategy on the CA LSI-2, for instance.  But WM has no indirection,
and no PC-relative mode.

Alternatively, one could load the address into a register, using a load
literal instruction (as is done on the M/500).  But there is no such
instruction.

As far as can be discovered, an address can be copmposed only of one,
two, or three registers, and one or two 5-bit literals.  This means it
is almost impossible, in one instruction, to construct the address of
anything efficiently.  To generate a 32-bit static address, by my
count, takes three instructions.  To generate a 16-bit unsigned
displacement takes two instructions, and a third is then required to
access the based variable so addressed.

As another example, consider accessing a local variable.  If it is
within 32 bytes of the frame pointer, we are lucky:

	(fp) + (displacement noop noreg)

will serve.  If it is, say, at local address 40, then we might say

	(fp) + (8 + (rx))

where rx holds the useful constant 32, and at the price of an extra
addition (and surely an extra cycle) we can now encode a six-bit
displacement.  (Actually, it would be marginally better to reverse the
registers).  But how are we to load 32 into rx?  Perhaps by

	rx := 16 + (16 noop noreg)

To summarise: the addressing capability of the machine is desperately
inadequate.  It might, as its author says, allow "scaled-index" modes
to be generated, but one pays a high price for this fairly negligible
capability. It would have been far better to use the bits of this
instruction mode to encode, for instance, base register and signed
16-bit displacement.  In fact, they could have encoded two base registers
and such a displacement.


Conclusion
----------

The WM machine contains some standard RISC features and some novel features.

Almost all the novel features are both complicated and bad; the claims that
they will enhance performance are, in this author's opinion, unfounded:

. FIFO load and store queues give inferior performance, at far higher
  cost, than loads and stores that use general registers

. executing two coupled operations per instruction creates needless
  dependency, and so reduces performance

. a FIFO condition code queue serves no purpose and solves no problem

. the machine is crippled by inadequate addressing capability

Finally, any implementation of this design will be faced with serious
problems in building correct synchronous traps, efficient task context
switches, and transparent memory management.


--------