Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!shadooby!samsung!gem.mps.ohio-state.edu!apple!baum
From: baum@Apple.COM (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: RISC vs CISC (rational discussion..) + new IBM 'beyond RISC'
Message-ID: <36340@apple.Apple.COM>
Date: 9 Nov 89 22:46:13 GMT
References: <503@ctycal.UUCP> <31031@winchester.mips.COM>
Reply-To: baum@apple.UUCP (Allen Baum)
Organization: Apple Computer, Inc.
Lines: 112

[]
>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>ARGUMENT 1: RISC is better because it's smaller, for new technologies
>When die size is a limit, RISC is better, because you can do it,
   Agreed. However, current trends appear to indicate that new technologies
   mature reasonably quickly. This argument works for just a few (<10?) years.
>ARGUMENT 2: RISC is better for cost reasons, because it's smaller
   When you get to 1M transistors on a chip, the space for some extra decoding
   logic is neglible. This argument works only during the initial (e.g.
   'new technology' phase.
>ARGUMENT 3: RISC is better, because even if there is enough space on a die
>	to put a whole CPU plus other things, the RISC can afford more space
>	for caches and other good things, and so it will be faster.
    See argument above. What is the performance difference between a 32k cache
    and a 31k cache?
>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster
>	time to design and test the chips.
>	COMMENT: maybe, maybe not.
   Agreed. Notice as we get more and more transistors, we start to do more
   complex (superscalar, hi-perf FP, graphics) things, not just add more
   cache.  Superscalar may double your performance. It would literally
   impossible to add enough cache to do that, so simple, regular hardware is
   probably not where the transistors will go. (for disbelievers, if your
   cache miss penalty * miss ratio <1, then even if every access hits, you'll
   save less than a cycle, so CPI approaches 1. Superscalar does better.)

>ARGUMENT 4: But when you can get zillions** of transistors  on a die, it
>	it doesn't matter.
>	... More transistors will help everything, however, it may be
>		that the "limiting factor" will be not die space,
>		but COMPLEXITY in critical paths and exception-handling.
>
>This last point is illustrated by the recent i486 bugs, and also, by the
>errata list carried for a long time by the i386

Ah, yes. I think its safe to say that our simple RISCs are going to start to
get fairly complex, as we start to play all the little hardware tricks we've
known from the supercomputer world, and some that we'll invent. Note that a
lot of errors come from exception handling kinds of problems, and while CISC
machines have them, superscalar, out-of-order execution RISC machines will
have them in spades.

I'm beginning to believe Nick Treddinick when he says that RISCs aren't better;
just newer. In a few years, I think we may find that CISCs will be back, to
some extent. The difference is that these 'CISC's will be a bit more carefully
tuned to allow the hardware techniques now being pioneered by RISCs to work
on them. Certainly, the compiler technology will permit them to be used
efficiently; as noted in an earlier posting this week, RISC vs. CISC compilers
seem to just trade off the problems of instruction scheduling for instruction
selection. Both are do-able.

To give a flavor of what I mean, I'll summarize the RT/2 papers from IBM.
Note that one paper referenced the other as having the title "An IBM post-RISC
Processor Architecture", although that wasn't the title it was given.

Superscalar w/ 1 fix, 1 float, 1 branch, & 1 cond. code op simultaneously

Branch instructions:
   branch on any bit of CC reg
   each has bit that enables storing of PC+1 into -->dedicated<-- link reg.
   -->no<-- delayed branching
   taken conditional branch: 0 to 3 cycles (depends when CC is set)
   -->dedicated<-- counter reg, for decr&branch if 0 ops, which can be
    combined with test of any bit in CC reg.

Cond Code. ops:
   Any boolean operation on any two of 32 bits in CC reg. Useful for generating
   compound Boolean expressions. Frequently used booleans can be kept in
   CC reg.

Fixed ops:
   Multiply & divide included, with dedicated MQ reg.
   Support for min, max, & abs
   Support for arbitrarily aligned byte string compare & move, both length
     specified & null terminated. Hardware dedicated byte count & comparision
     register included in state. String instructions defined to permit max. 
     theoretical bux bandwidth to be use, w/ very low overhead for short
     strings.
   Auto-incr & decr address modes
   Hardware handling for load/store of misaligned data (as long as its within
     a cache line). Optional fault if it crosses cache lines.

Floating Point ops:
   Multiply & Add w/ only one round, takes same time as either add or mult.
   Reg. renaming

Overall:
   all interrupts/traps are precise
   Icache: 8K byte, 64 byte line,
           32 entry 2 way set assoc. TLB
   Dcache: 64K byte, 4 way set assoc, 128 byte line,
          128 entry 2 way set assoc. TLB
   -->hardware<-- table walking
   Dcache has load & store buffers (store buffers so load can be performed
      before cache writeback, load buffer so loads can proceed during filling).
   Mem system has ECC & bit steering (allows spare bit to be substituted for
      failing bit). 4-bit DRAMs scattered across ECC groups so a chip failure 
      is detectable.
   4 deep 'pending store queue' permit address translation & checking even if
   data is still being calculated.

Memory addressing:
   52 bit virtual, 32 bit physical
   upper 4 bits of 32 bit address selects one of 16 24-bit seg. regs.(24+28=52)
   Seg. regs. have an I/O bit & lock enable bits
   Lock enable turns on the hardware lock & transaction ID hardware (801 & RT
     style.)
   Hardware can use low 20 bits of virtual address for translation lookup. 
     Software must ensure that aliasing is avoided
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum