Xref: utzoo comp.sys.next:16567 comp.arch:22260
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!uunet!kithrup!sef
From: sef@kithrup.COM (Sean Eric Fagan)
Newsgroups: comp.sys.next,comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Message-ID: <1991Apr26.073829.4625@kithrup.COM>
Date: 26 Apr 91 07:38:29 GMT
References: <1991Apr22.044553.16805@mp.cs.niu.edu> <1991Apr24.170804.25670@kithrup.COM> <1991Apr25.025800.4377@mp.cs.niu.edu>
Followup-To: comp.arch
Organization: Kithrup Enterprises, Ltd.
Lines: 104

(Here we go again.  *sigh*  Are people really this ignorant?)

In article <1991Apr25.025800.4377@mp.cs.niu.edu> bennett@mp.cs.niu.edu (Scott Bennett) writes:
[in response to my assertion that making a superscalar CISC machine {e.g.,
68050} is much harder than with most of the current RISC machiens.]
>     If you disallow pipelining in the CISC machine, then it is most
>likely to be impossible to have so-called superscalar operation.  

Who said I was disallowing pipelining.  Pipeline your bloody CISC to death
for all I care.  For the most part, it won't make that much difference:
most CISC chips, such as the 68k and iAPX*86 series, tend to do too many
memory references in each instruction to make superscalar feasible.  Or
don't you realize you can only access one memory location at a time?  (Well,
not completely true, but true enough.)

>However,
>most CISC machines now are not only pipelined, they are *multiply* pipe-
>lined.  

Oooh.  They have more than one stage of pipelining.  Like all of the current
RISC chips, which had them before the CISC chips.

>Since a superscalar RISC can only be that way by pipelining,
>let's at least compare only pipelined architectures.  FWIW, the MC68040
>supposedly averages about 1.3 clock cycles per instruction because of
>the pipelining used.  That obviously doesn't reach "superscalar", but
>it isn't terribly far off, either.

Bullshit.  It is *very* far off.  Note the word "supposedly" in your
statement.  Please look at John Mashey's figures; I think they indicate
slightly higher (1.4 or 1.5 CPI?) for the '40; on the other hand, the R3000
got what, 1.2 or 1.3.

>     In any case, what really matters is how much work gets done per
>clock cycle, not how many instructions get done per cycle.  

No, it doesn't.  What matters is *how quickly you can get your job done*.  I
don't care if you can do a POLY instruction in 3 cycles; if you still take 2
cycles to do an add, most current RISC chips will blow you away (unless your
application consists of POLY instructions).

|One example
|is the case of moving blocks of data from one memory location to another.
|A typical RISC must 1) initialize a loop (one or more instruction fetch/
|decodes) and in the body of the loop must 2) load a word into a register
|(one fetch/decode), 3) store from the register into the new location (one
|fetch/decode), 4) increment both addresses (probably two fetches/decodes),
|5) loop back to repeat until finished (at least one fetch/decode).  Some
|CISCs have something like a "repeat" instruction that will execute 
|another instruction (e.g. a storage-to-storage move) a given number of
|times while incrementing addresses in that instruction, so the whole
|operation may require as few as two fetches/decodes.  Other CISCs have
|single instructions capable of doing block moves, so they only need one
|fetch decode.  That means more of the cycles required get spent doing the
|actual work that needs to be done than would be the case with a RISC.  A
|CISC operating in such a way would be at the *opposite* end of the spectrum
|from "superscalar", but would get its work done more quickly anyway.

This was so precious I decided to keep all of it intact.

Note how, because "some" CISCs have a "repeat" instruction, which doesn't
necessarily buy you anything (talk to Henry Spencer), all CISCs are better.

Never mind that fact that for most RISCS and CISCs the code is almost
identical, with only optimizations for the specific processor.

Listen *very* carefully:  as of right now, the most popular chip that has a
repeat instruction is the '386 and '486.  For the '486, MOVS instruction, no
prefix, takes 7 clock cycles.  A "REP MOVSB" takes 5, *if you are moving 0
bytes*, 13, *if you are moving 1 byte*, and 12+3*(number of bytes).  The
overhead is essentially the same for setting up the rep instruction as it is
otherwise (unless you have other uses for the registers MOVS and REP want,
in which case you have to spill them, and reload, which is going to add even
more time).

The '40 doesn't have a repeat instruction; it's block-memory move loop looks
very much like the RISC version, except that the RISC versions can generally
take advantage of overlapping memory loads/stores.  I.e.

	lb	$temp, ($base + $inc)
	addu	$inc, 1, $inc
	sb	$temp, ($src + $inc)

can take a total of three cycles (well, four:  the sw needs to complete).  A
68k is likely to use something like

	mov.b	[a0+d1], [a1+d1]
	add.l	d1, $1

or somesuch (sorry I'm not completely up to date on my 68k assembly; it's
been a while).  Note that the mov instruction has two memory references;
this is BAD.  (Even if I'm wrong, and there's only one memory reference per
instruction, making it look like the RISC version, the '40 still doesn't
have overlapping loads, I believe.)

Go *learn* before you start stating that people who know a lot more than you
(not me; I mostly just nod my head and agree with people like mashey and
patterson) are complete fools for not doing things properly.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.