Path: utzoo!mnetor!uunet!husc6!bbn!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.arch Subject: The WM Machine Message-ID: <5339@aw.sei.cmu.edu> Date: 4 May 88 20:04:25 GMT Sender: netnews@sei.cmu.edu Lines: 285 This post is a fairly long review of the WM machine recently described in Computer Architecture News. It is about 280 lines long, and follows this form feed. It assumes (a) you have read the paper, and (b) care tuppence about my thoughts. The opinions, misunderstandings, errors &c are mine alone. The WM Computer Archicecture ============================ [Wm A Wulf, Computer Architecture News, 16 (1988) pp 70..84] Comments by Robert Firth, CMU-SEI, May 1988 Introduction ------------ The WM computer architecture, described in the cited paper, is a sketch of part of the order code of a new computer processor. This processor is designed, it its author's words, to retain the RISC philosophy while achieving significantly greater performance than other RISC designs. This Note comments on the major features of the architecture as described, and reflects on their possible effectiveness. Major Features -------------- The major features of the WM architecture are (a) a load/store design (b) loads and stores use implied FIFO operand and result queues (c) operations use registers (or FIFO) for operands and results (d) each instruction performs TWO operations on THREE operadns, after the pattern result := rand1 rator1 (rand2 rator2 rand3) (e) relational operators return their right operand as result and load a condition into a FIFO; conditional jumps read the condition FIFO There are also facilities for a floating point coprocessor, a "stream mode" load or store, and multiple load FIFOs. As shown, the design allows for 32 named integer registers (of which one or two may in fact be FIFOs), 32 floating registers (similarly), 16 general integer operations, 16 general floating operations, and 32 "special" operations. The paper does not describe any of these operations, but by implication they include arithmetic, logical, and relational. General Observations -------------------- The first observation is that the paper presents only a sketch. There is no detail of any of the operations, no detail on the special instructions other than a brief account of the "stream" loads, and no description of even so important an operation as the subroutine call. Moreover, while the claims for the machine are made in terms of execution speed, data access speed, and so on, the presentation is purely at the level of the instruction set architecture; there are no timing diagrams, data flow diagrams, or other lower-level detail. This incompleteness makes it hard to assess the design; nevertheless, the following sections offer some critical commentary, on specific and more general features. The next observation is clear from Figure 1 of the reference: the instruction set space is almost saturated. As Bell and Newell remark, this is a mistake in the design of a new machine, since it leaves no room for extensions or afterthoughts. In detail: the instruction class is indicated by the leading four bits, and all combinations are used. In all classes except 2#1110# (special), the remaining bit patterns are almost fully allocated. The special class admits 32 operations; the text implies that a large number are already allocated. Load/Store ---------- A load/store design serves two purposes (a) it simplifies the instruction set by allowing general addresses to appear in only a few places. In particular, it usually allows the instruction encoding to use a single uniform size. (b) it improves performance by uncoupling the fetch of a value from its use, thereby allowing the fetch delay to be overlapped with useful work. Previous RISC experience shows that these purposes are well served by this design, and there seems little reason to doubt the WM design will work as well. FIFOs ----- The use of FIFO queues, however, is more problematical. There is some gain in instruction density, since the destination of a load, or the source of a store, need not be specified. But this is a very small gain, since most of a load or store instruction is taken up with the memory address. I do not believe there is any performance gain. If loads or stores can be overlapped, it is relatively easy for a compiler to target successive loads into different registers. With 32 general registers available, there will surely be enough for any reasonable sequence of loads or stores. The one exception - of which there is an example in the paper - is when the loads are in a tight loop and the load delay runs round the loop. But there the rate-determining step is the load, regardless of how it is buffered, and loop unrolling can almost always avoid load destination clashes. Moreover, loads into explicit registers offer several advantages. First, the operands need not be used in the order in which they are loaded, which is often helpful, especially if operations have to be reordered to take advantage of parallel ALUs. Secondly, and far more important, the use of an operand does not destroy it. It is significant that none of the examples given in the paper ever uses the same operand twice; such an example would show clearly how a load FIFO forces an extra instruction to be generated. Finally, FIFOs cause severe implementation problems, one of which at least - efficient saving and restoring of FIFO contents on a task context switch - has never to my knowledge been satisfactorily solved. Other serious problems include resetting the FIFO after an interrupt or exception, and resuming an orderly flow of loads and stores after a memory-management trap. In sum, I believe the use of FIFOs offers no advantages over loads and stores into and out of the general registers, and incurs several disadvantages. Operations use Registers ------------------------ This is another piece of the normal RISC approach, and is again probably a good feature. However, even a RISC machine must provide operations directly on memory, for processor synchronization; the absence of any such instructions in WM might be thought a major oversight. But it is clear from the paper as a whole, that the author is describing really only that part of the machine that a compiler writer would see. Two Operations per Instruction ------------------------------ This feature seems to me quite unwarranted. One major point of a RISC design is that instructions are simple enough to be executed quickly. But here is a machine where, within one instruction, there are two operations, the one gated on the result of the other. This forces the instruction to take twice as long, regardless of how many execution units might be available. A design in which twice as many instructions are executed, but successive instructions are independent, clearly cannot perform worse, and will perform better if several execution units can be run in parallel. One main lesson of prior designs - achieve higher parallelism by uncoupling mutually dependent operations - has simply been ignored. The WM sketch intends the first half of each instruction to be overlapped with the second half of the preceding instruction, and there is a dependency rule about that. But the same apparatus is all that is needed to execute independent instructions in parallel, so the implementation complexity of WM is no less, but its performance is impeded by the extra dependency within each instruction. Relational Operations and Conditions ------------------------------------ It is a matter of debate whether condition codes are justified, rather than having the comparison yield a Boolean result or even drive the conditional branch directly. Proponents of RISC would probably tend to regard condition codes as a needless complication. If that is so, then, a fortiori, a condition-code FIFO is a multiply needless complication. In many years of practical programming, I can recall only a couple of cases where it would have been useful to enqueue two sets of condition codes for successive testing. Accordingly, I must pronounce this feature useless. Moreover, there are several RISC machines where a comparison, test, and jump decision can be done in one cycle. Therefore, no purpose is served by separating the comparison from the jump decision, since there is no latency to be overlapped. And the branch-delay optimisation issue is equally present, under a different guise, in the WM machine, as its author admits. The presence of yet another FIFO adds yet more complexity to the handling of traps, interrupts, or context switches. Address Modes ------------- As a final point of detail, let us look not at a feature, but at a non feature. The WM machine has no address modes. Instead, the load and store instructions perform an integer computation, again of the form rand1 rator1 (rand2 rator2 rand3) and the result becomes the effective address for the memory reference. Consider first, how one might access a simple static variable, at a given (32-bit absolute) address. Clearly, one cannot say the equivalent of Load @16#abcdefgh# since the machine does not permit an absolute address as an operand. Another way, would be to store the address in a literal pool, and access the operand indirectly, by a PC-relative or register-based mode. This is the strategy on the CA LSI-2, for instance. But WM has no indirection, and no PC-relative mode. Alternatively, one could load the address into a register, using a load literal instruction (as is done on the M/500). But there is no such instruction. As far as can be discovered, an address can be copmposed only of one, two, or three registers, and one or two 5-bit literals. This means it is almost impossible, in one instruction, to construct the address of anything efficiently. To generate a 32-bit static address, by my count, takes three instructions. To generate a 16-bit unsigned displacement takes two instructions, and a third is then required to access the based variable so addressed. As another example, consider accessing a local variable. If it is within 32 bytes of the frame pointer, we are lucky: (fp) + (displacement noop noreg) will serve. If it is, say, at local address 40, then we might say (fp) + (8 + (rx)) where rx holds the useful constant 32, and at the price of an extra addition (and surely an extra cycle) we can now encode a six-bit displacement. (Actually, it would be marginally better to reverse the registers). But how are we to load 32 into rx? Perhaps by rx := 16 + (16 noop noreg) To summarise: the addressing capability of the machine is desperately inadequate. It might, as its author says, allow "scaled-index" modes to be generated, but one pays a high price for this fairly negligible capability. It would have been far better to use the bits of this instruction mode to encode, for instance, base register and signed 16-bit displacement. In fact, they could have encoded two base registers and such a displacement. Conclusion ---------- The WM machine contains some standard RISC features and some novel features. Almost all the novel features are both complicated and bad; the claims that they will enhance performance are, in this author's opinion, unfounded: . FIFO load and store queues give inferior performance, at far higher cost, than loads and stores that use general registers . executing two coupled operations per instruction creates needless dependency, and so reduces performance . a FIFO condition code queue serves no purpose and solves no problem . the machine is crippled by inadequate addressing capability Finally, any implementation of this design will be faced with serious problems in building correct synchronous traps, efficient task context switches, and transparent memory management. --------