Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!shadooby!samsung!gem.mps.ohio-state.edu!apple!baum From: baum@Apple.COM (Allen J. Baum) Newsgroups: comp.arch Subject: Re: RISC vs CISC (rational discussion..) + new IBM 'beyond RISC' Message-ID: <36340@apple.Apple.COM> Date: 9 Nov 89 22:46:13 GMT References: <503@ctycal.UUCP> <31031@winchester.mips.COM> Reply-To: baum@apple.UUCP (Allen Baum) Organization: Apple Computer, Inc. Lines: 112 [] >In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >ARGUMENT 1: RISC is better because it's smaller, for new technologies >When die size is a limit, RISC is better, because you can do it, Agreed. However, current trends appear to indicate that new technologies mature reasonably quickly. This argument works for just a few (<10?) years. >ARGUMENT 2: RISC is better for cost reasons, because it's smaller When you get to 1M transistors on a chip, the space for some extra decoding logic is neglible. This argument works only during the initial (e.g. 'new technology' phase. >ARGUMENT 3: RISC is better, because even if there is enough space on a die > to put a whole CPU plus other things, the RISC can afford more space > for caches and other good things, and so it will be faster. See argument above. What is the performance difference between a 32k cache and a 31k cache? >ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster > time to design and test the chips. > COMMENT: maybe, maybe not. Agreed. Notice as we get more and more transistors, we start to do more complex (superscalar, hi-perf FP, graphics) things, not just add more cache. Superscalar may double your performance. It would literally impossible to add enough cache to do that, so simple, regular hardware is probably not where the transistors will go. (for disbelievers, if your cache miss penalty * miss ratio <1, then even if every access hits, you'll save less than a cycle, so CPI approaches 1. Superscalar does better.) >ARGUMENT 4: But when you can get zillions** of transistors on a die, it > it doesn't matter. > ... More transistors will help everything, however, it may be > that the "limiting factor" will be not die space, > but COMPLEXITY in critical paths and exception-handling. > >This last point is illustrated by the recent i486 bugs, and also, by the >errata list carried for a long time by the i386 Ah, yes. I think its safe to say that our simple RISCs are going to start to get fairly complex, as we start to play all the little hardware tricks we've known from the supercomputer world, and some that we'll invent. Note that a lot of errors come from exception handling kinds of problems, and while CISC machines have them, superscalar, out-of-order execution RISC machines will have them in spades. I'm beginning to believe Nick Treddinick when he says that RISCs aren't better; just newer. In a few years, I think we may find that CISCs will be back, to some extent. The difference is that these 'CISC's will be a bit more carefully tuned to allow the hardware techniques now being pioneered by RISCs to work on them. Certainly, the compiler technology will permit them to be used efficiently; as noted in an earlier posting this week, RISC vs. CISC compilers seem to just trade off the problems of instruction scheduling for instruction selection. Both are do-able. To give a flavor of what I mean, I'll summarize the RT/2 papers from IBM. Note that one paper referenced the other as having the title "An IBM post-RISC Processor Architecture", although that wasn't the title it was given. Superscalar w/ 1 fix, 1 float, 1 branch, & 1 cond. code op simultaneously Branch instructions: branch on any bit of CC reg each has bit that enables storing of PC+1 into -->dedicated<-- link reg. -->no<-- delayed branching taken conditional branch: 0 to 3 cycles (depends when CC is set) -->dedicated<-- counter reg, for decr&branch if 0 ops, which can be combined with test of any bit in CC reg. Cond Code. ops: Any boolean operation on any two of 32 bits in CC reg. Useful for generating compound Boolean expressions. Frequently used booleans can be kept in CC reg. Fixed ops: Multiply & divide included, with dedicated MQ reg. Support for min, max, & abs Support for arbitrarily aligned byte string compare & move, both length specified & null terminated. Hardware dedicated byte count & comparision register included in state. String instructions defined to permit max. theoretical bux bandwidth to be use, w/ very low overhead for short strings. Auto-incr & decr address modes Hardware handling for load/store of misaligned data (as long as its within a cache line). Optional fault if it crosses cache lines. Floating Point ops: Multiply & Add w/ only one round, takes same time as either add or mult. Reg. renaming Overall: all interrupts/traps are precise Icache: 8K byte, 64 byte line, 32 entry 2 way set assoc. TLB Dcache: 64K byte, 4 way set assoc, 128 byte line, 128 entry 2 way set assoc. TLB -->hardware<-- table walking Dcache has load & store buffers (store buffers so load can be performed before cache writeback, load buffer so loads can proceed during filling). Mem system has ECC & bit steering (allows spare bit to be substituted for failing bit). 4-bit DRAMs scattered across ECC groups so a chip failure is detectable. 4 deep 'pending store queue' permit address translation & checking even if data is still being calculated. Memory addressing: 52 bit virtual, 32 bit physical upper 4 bits of 32 bit address selects one of 16 24-bit seg. regs.(24+28=52) Seg. regs. have an I/O bit & lock enable bits Lock enable turns on the hardware lock & transaction ID hardware (801 & RT style.) Hardware can use low 20 bits of virtual address for translation lookup. Software must ensure that aliasing is avoided -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum