Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!mash From: mash@mips.com (John Mashey) Newsgroups: comp.arch Subject: Re: RISC vs CISC (very long) Message-ID: <2419@spim.mips.COM> Date: 18 Apr 91 04:40:41 GMT References: <1991Apr03.232400.1560@kithrup.COM> <1991Apr7.064855.25469@zoo.toronto.edu> <537@appserv.Eng.Sun.COM> Sender: news@mips.COM Organization: MIPS Computer Systems, Inc. Lines: 395 Nntp-Posting-Host: winchester.mips.com WARNING: you may want to print this one to read it... In article <537@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes: >In article <1991Apr7.064855.25469@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >>Or for a *really* straight comparison, due to John Mashey I think, compare >>the i860 to the i486: same tools, same process, same chip size, roughly >>the same release time... and the RISC machine is faster, much faster, in >>every way. > >I respect Henry, despite his annoying signatures, but I can't agree >with this comparison. The i860 is, to the best of my knowledge, a >clean start. The i486 is carrying baggage from all the way back to >8080's. (I personally think the i486 is a cool chip, if you look >closely, it is quite RISC like in the most common instruction uses. >First the 386, then the 486. Hmm. If Intel keeps this up, they might >make a decent CPU one day :-) > >Anyway, it is not a fair comparison. Not by a long stretch. Let's see >how the Nth generation SPARC, MIPS, and 88K's do (assuming they last) >compared to some new design from scratch. Well, there is baggage and there is BAGGAGE. One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION: a) Architectures persist longer than implementations, especially user-level Instruction-Set Architecture. b) The first member of an architecture family is usually designed with the current implementation constraints in mind, and if you're lucky, software people had some input. c) If you're really lucky, you anticipate 5-10 years of technology trends, and that modifies your idea of the ISA you commit to. d) It's pretty hard to delete anything from an ISA, except where: 1) You can find that NO ONE uses a feature (the 68020->68030 deletions mentioned by someone else). 2) You believe that you can trap and emulate the feature "fast enough". i.e., microVAX support for decimal ops, 68040 support for transcendentals. Now, one might claim that the i486 and 68040 are RISC implementations of CISC architectures .... and I think there is some truth to this, but I also think that it can confuse things badly: Anyone who has studied the history of computer design knows that high-performance designs have used many of the same techniques for years, for all of the natural reasons, that is: a) they use as much pipelining as they can, in soem cases, if this means a high gate-count, then so be it. b) They use caches (separate I & D if convenient). c) They use hardware rather than micro-code for the simpler operations. (For instance, look at the evolution of the S/360 products. Recall that the 360/85 used caches, back around 1969, and within a few years, so did any mainframe or supermini.) So, what difference is there among machines if similar implementation ideas are used? A: there is a very specific set of characteristics shared by most machines labeled RISCs, most of which are not shared by most CISCs. The RISC characteristics: a) Are aimed at more performance from current compiler technology (i.e., enough registers). OR b) Are aimed at fast pipelining in a virtual-memory environment with the ability to still survive exceptions without inextricably increasing the number of gate delays (notice that I say gate delays, NOT just how many gates). Even though various RISCs have made various decisions, most of them have been very careful to omit those things that CPU designers have found difficult and/or expensive to implement, and especially, things that are painful, for relatively little gain. I would claim, that even as RISCs evolve, they may have certain baggage that they'd wish weren't there .... but not very much. In particular, there are a bunch of objective characteristics shared by RISC ARCHITECTURES that clearly distinguish them from CISC architectures. I'll give a few examples, followed by the detailed analysis from an upcoming talk: MOST RISCs: 3a) Have 1 size of instruction in an instruction stream 3b) And that size is 4 bytes 3c) Have a handful (1-4) addressing modes) (* it is VERY hard to count these things; will discuss later). 3d) Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory) 4a) Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory. 4b) Have no more than 1 memory-addressed operand per instruction 5a) Do NOT support arbitrary alignment of data for loads/stores 5b) Use an MMU for a data address no more than once per instruction 6a) Have >=5 bits per integer register specifier 6b) Have >= 4 bits per FP register specifier These rules provide a rather distinct dividing line among architectures, and I think there are rather strong technical reasons for this, such that there is one more interesting attribute: almost every architecture whose first instance appeared on the market from 1986 onward obeys the rules above ..... Note that I didn't say anything about counting the number of instructions.... So, here's a table: C: number of years since first implementation sold in this family (or first thing of which this is binary compatible with) 3a: # instruction sizes 3b: maximum instruction size in bytes 3c: number of distinct addressing modes for accessing data (not jumps)> I didn't count register or literal, but only ones that referenced memory, and I counted different formats with different offset sizes separately. This was hard work... Also, even when a machine had different modes for register-relative and PC_relative addressing, I counted them only once. 3d: indirect addressing: 0: no, 1: yes 4a: load/store combined with arithmetic: 0: no, 1:yes 4b: maximum number of memory operands 5a: unaligned addressing of memory references allowed in load/store, without specific instructions 0: no never (MIPS, SPARC, etc) 1: sometimes (as in RS/6000) 2: just about any time 5b: maximum number of MMU uses for data operands in an instruction 6a: number of bits for integer register specifier 6b: number of bits for 64-bit or more FP register specifier, distinct from integer registers Note that all of this are ARCHIECTURE issues, and it is usually quite difficult to either delete a feature (3a-5b) or increase the number of real registers (6a-6b) given an initial isntruction set design. (yes, register renaming can help, but...) Now: items 3a, 3b, and 3c are an indication of the decode complexity 3d-5b hint at the ease or difficulty of pipelining, especially in the presence of virtual-memory requirements, and need to go fast while still taking exceptions sanely items 6a and 6b are more related to ability to take good advantage of current compilers. There are some other attributes that can be useful, but I couldn't imagine how to create metrics for them without being very subjective; for example "degree of sequential decode", "number of writebacks that you might want to do in the middle of an instruction, but can't, because you have to wait to make sure you see all of the instruction before committing any state, because the last part might cause a page fault," or "irregularity/assymetricness of register use", or "irregularity/complexity of instruction formats". I'd love to use those, but just don't know how to measure them. Also, I'd be happy to hear corrections for some of these. So, here's a table of 12 implementations of various architectures, one per architecture, with the attributes above. Just for fun, I'm going to leave the architectures coded at first, although I'll identify them later. I'm going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and also, at the head of each column, I'm going to put a rule, which, in that column, most of the RISCs obey. Any RISC that does not obey it is marked with a +; any CISC that DOES obey it is marked with a *. So... CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- A1 4 1 4 1 0 0 1 0 1 8 3+ 1 B1 5 1 4 1 0 0 1 0 1 5 4 - C1 2 1 4 2 0 0 1 0 1 5 4 - D1 2 1 4 3 0 0 1 0 1 5 0+ 1 E1 5 1 4 10+ 0 0 1 0 1 5 4 1 F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 G1 1 1 4 4 0 0 1 1 1 5 5 - H1 2 1 4 4 0 0 1 0 1 5 4 - RISC --------------------------------------------------------------- L4 26 4 8 2* 0* 1 2 2 4 4 2 2 CISC M2 12 12 12 15 0* 1 2 2 4 3 3 1 N1 10 21 21 23 1 1 2 2 4 3 3 - O3 11 11 22 44 1 1 2 2 8 4 3 - P3 13 56 56 22 1 1 6 2 24 4 0 - An interesting exercise is to analyze the ODD cases. First, observe that of 12 architectures, in only 2 cases does an architecture have an attribute that puts it on the wrong side of the line. of the RISCs: -A1 is slightly unusual in having more integer registers, and less FP than usual. -D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0). -E1 seems odd in having a large number of address modes. I think most of this is an artifact of the way that I counted, as this architecture really only has a fundamentally small number of ways to create addresses, but has several different-sized offsets and combinations, but all within 1 4-byte instruction; I believe that it's addressing mechanisms are fundamentally MUCH simpler than, for example, M2, or especially N1, O3, or P3, but the specific number doesn't capture it very well. -F1 .... is not sold any more. -H1 one might argue that this process has 2 sizes of instructions, but I'd observe that at any point in the instruction stream, the instructions are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit, i.e., not dynamically encoded in every instruction. Of the processors called CISCs: -L4 happens to be one in which you can tell the length of the instruction from the first few bits, has a fairly regular instruction decode, has relatively few addressing modes, no indirect addressing. In fact, a big subset of its instructions are actually fairly RISC-like, although another subset is very CISCy. -M2 has a myriad of instruction formats, but fortunately avoided indirect addressing, and actually, MOST of instructions only have 1 address, except for a small set of string operations with 2. I.e., in this case, the decode complexity may be high, but most instructions cannot turn into multiple-memory-address-with-side-effects things. -N1,O3, and P3 are actually fairly clean, orthogonal architectures, in which most operations can consistently have operands in either memory or registers, and there are relatively few weirdnesses of special-cased uses of registers. Unfortunately, they also have indirect addressing, instruction formats whose very orthogonality almost guarantees sequential decoding, where it's hard to even know how long an instruction is unitl you parse each piece, and that may have side-effects where you'd like to do a register write-back early, but either: must wait until you see all of the instruction until you commit state or must have "undo" shadow-registers or must use instruction-continuation with fairly tricky exception handling to restore the state of the machine It is also interesting to note that the original member of the family to which O3 belongs was rather simpler in some of the critical areas, with only 5 instruction sizes, of maximum size 10 bytes, and no indirect addressing, and requiring alignment (i.e., it was a much more RISC-like design, and it would be a fascinating speculation to know if that extra complexity was useful in practice). Now, here's the table again, with the labels: CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K B1 5 1 4 1 0 0 1 0 1 5 4 - R2000 C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000 E1 5 1 4 10+ 0 0 1 0 1 5 4 1 HP PA F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000 H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860 --------------------------------------------------------------- L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM 3090 M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486 N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040 P3 13 56 56 22 1 1 6 2 24 4 0 - VAX General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal archiecture may actually be harder to make run fast than one that is less clean. Obviously, every weirdness has it's penalties.... But consider the fundamental difficulty of pipelining something like (on a VAX): ADDL @(R1)+,@(R1)+,@(R2)+ (I.e., something that, might theoretically arise from: int **r1, **r2; **r2++ = **r1++ + **r1++; and which a RISC machine would do (most straightforwardly) as: lw r3,0(r1) *r1 add r1,4 r1++ lw r4,0(r1) *r1 again add r1,4 r1++ lw r5,0(r2) r5 = *r2 add r6,r3,r4 sum in r6 sw r6,0(r5) **r2 = sum add r5,4 r2++ (Now, some RISCs might use auto-increment to get rid of, for example, the last add; in any case, samrt compilers are quite likely to generate something more like: lw r3,0(r1) *r1 lw r4,4(r1) *r1 again add r1,8 r1++ lw r5,0(r2) r5 = *r2 add r6,r3,r4 sum in r6 sw r6,0(r5) **r2 = sum add r5,4 r2++ which has no stalls anywhere on most RISCs.) Now, consider what the VAX has to do: 1) Decode the opcode (ADD) 2) Fetch first operand specifier from I-stream and work on it. a) Compute the memory address from (r1) If aligned run through MMU if MMU miss, fixup access cache if cache miss, do write-back/refill Elseif unaligned run through MMU for first part of data if MMU miss, fixup access cache for that part of data if cache miss, do write-back/refill run through MMU for second part of data if MMU miss, fixup access cache for second part of data if cache miss, do write-back/refill Now, in either case, we now have a longword that has the address of the actual data. b) Increment r1 [well, this is where you'd LIKE to do it, or in parallel with step 2a).] However, see later why not... c) Now, fetch the actual data from memory, using the address just obtained, doing everything in step 2a) again, yielding the actual data, which we needto stick in a temporary buffer, since it doesn't actually go in a register. 3) Now, decode the second operand specifier, which goes thru everything that we did in step 2, only again, and leaves the results in a second temporary buffer. Note that we'd like to be starting this before we get done with all of 2 (and I THINK the VAX9000 probably does that??) but you have to be careful to bypass/interlock on potential side-effects to registers .... actually, you may well have to keep shadow copies of every register that might get written in the instruction, since every operand can use auto-increment/decrement. You'd probably want badly to try to compute the address of the second argument and do the MMU access interleaved with the memory access of the first, although the ability of any operand to need 2-4 MMU accesses probably makes this tricky. [Recall that any MMU access may well cause a page fault....] 4) Now, do the add. [could cause exception] 5) Now, do the third specifier .... only, it might be a little different, depending on the nature of the cache, that is, you cannot modify cache or memory, unless you know it will complete. (Why? well, suppose that the location you are storing into overlaps with one of the indirect-addressing words pointed to by r1 or 4(r1), and suppose that the store was unaligned, and suppose that the last byte of the store crossed a page boundary and caused a page fault, and that you'd already written the first 3 bytes. If you did this straightforwardly, and then tried to restart the instruction, it wouldn't do the same thing the second time. 6) When you're sure all is well, and the store is on its way, then you can safely update the two registers, but you'd better wait until the end, or else, keep copies of any modified registers until you're sure it's safe. (I think both have been done ??) 7) You may say that this code is unlikely, but it is legal, so the CPU must do it. This style has the following effects: a) You have to worry about unlikely cases. b) You'd like to do the work, with predictable uses of functional units, but instead, they can make unpredictable demands. c) You'd like to minimize the amount of buffering and state, but it costs you in both to go fast. d) Simple pipelining is very, very tough: for example, it is pretty hard to do much about the next instruction following the ADDL, (except some early decode, perhaps), without a lot of gates for special-casing. (I've always been amazed that CVAX chips are fast as they are, and VAX 9000s are REALLY impressive...) e) EVERY memory operand can potentially cause 4 MMU uses, and hence 4 MMU faults that might actually be page faults... 8) Consider how "lazy" RISC designers can be, with the RISC sequence shown: a) Every load/store uses exactly 1 MMU access. b) The compilers are often free to re-arrange the order, even across what would have been the next instruction on a CISC. This gets rid of some stalls that the CISC may be stuck with (especially memory accesses). c) The alignment requirement avoids especially the problem with sending the first part of a store on the way before you're SURE that the second part of it is safe to do. Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper: CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB SUMMARY: 1) RISCs share certain architectural characteristics, although there are differences, and some of those differences matter a lot. 2) However, the RISCs, as a group, are much more alike than the CISCs as a group. 3) At least some of these architectural characteristics have fairly serious consequences on the pipelinability of the ISA, especially in a virtual-memory, cached environment. 4) Counting instructions turns out to be fairly irrelevant: a) It's HARD to actually count instructions in a meaningful way... (if you disagree, I'll claim that the VAX is RISCier than any RISC, at least for part of its instruction set :-) b) More instructions aren't what REALLY hurts you, anywhere near as much features that are hard to pipeline: c) RISCs can perfectly well have string-support, or decimal arithmetic support, or graphics transforms ... or lots of strange register-register transforms, and it won't cause problems ..... but compare that with the consequence of adding a single instruction that has 2-3 memory operands, each of which can go indirect, with auto-increments, and unaligned data... -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086