Xref: utzoo comp.arch:6651 alt.next:170
Path: utzoo!hoptoad!pacbell!ames!amdahl!amdcad!crackle!tim
From: tim@crackle.amd.com (Tim Olson)
Newsgroups: comp.arch,alt.next
Subject: Re: RISC v. CISC (was The NeXT problem)
Message-ID: <23290@amdcad.AMD.COM>
Date: 17 Oct 88 23:12:24 GMT
References: <156@gloom.UUCP>
Sender: news@amdcad.AMD.COM
Reply-To: tim@crackle.amd.com (Tim Olson)
Organization: Advanced Micro Devices, Inc. Sunnyvale CA
Lines: 59
Summary:
Expires:
Sender:
Followup-To:

In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes:
| A while back, I was really hot on the idea of RISC.  Then a friend 
| pointed out a few things that set me straight...

I guess we are going to have to reset you straight, again! ;-)

| First, there is no good reason that all of the cache and pipeline
| enhancements cannot be put on to a CISC processor.

If it is a microcoded processor, than the CISC machine will have to
perform this pipelining at both the microinstruction and
macroinstruction level, in order to be able to execute simple
instructions in a single cycle.  This costs more than if the
micro and macro levels were the same (RISC).

| Second, to perform a complex task, a RISC chip will need more
| instructions than a CISC chip.

This is true, although it is typically only 30% more from dynamic
measurements, not the "3 to 5 times" that some people report.

| Third, given the same level of technology for each (ie caches, pipelines,
| etc), a microcode fetch is faster than a memory fetch.

Also true.  However, this only buys you anything if most of your
instructions take multiple cycles.  Unfortunately (?), most programs use
simple instructions which should execute in a single cycle.  If a CISC
processor is to compete effectively, it must also be able to execute the
most-used instructions in a single cycle.  Therefore, it must also have
the off-chip instruction bandwidth or on-chip cache bandwidth that RISC
requires.  With this requirement, it doesn't matter that microcode may
be slightly faster than a cache access -- the cache is the limiting
factor.

| As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
| correctly -- I don't have the book in front of me) 40 cycles.  A while
| back, I tried to write a 32 bit multiply macro that would take less 
| than the 40 or so that the '030 took.  I didn't even come close (even 
| assuming lots of registers and a 32 bit word size (which the 6502 
| doesn't have)).  

Most (if not all) RISCs address this by

	a) using existing floating-point multiply hardware (i.e. 32x32
	multiplier array) for integer multiply (1 - 4 cycles)

or
	b) having multiply sequencing or step operations that perform
	1-2 bits at a time (16 - 40 cycles)

so they are no slower than the current crop of CISC processors.  In
addition, if step operations are used, inexpensive "early-out"
calculations will allow the average multiply time to drop quite a bit
(because the distribution of runtime multiplies leans heavily towards
multipliers of 8 bits or less).

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)