Path: utzoo!attcan!utgpu!bnr-vpa!bnr-fos!bnr-public!schow
From: schow@bnr-public.uucp (Stanley Chow)
Newsgroups: comp.arch
Subject: Re: Bandwidth and RISC vs. CISC
Message-ID: <423@bnr-fos.UUCP>
Date: 20 Apr 89 06:06:36 GMT
References: <38853@bbn.COM>
Sender: news@bnr-fos.UUCP
Reply-To: schow@bnr-public.UUCP (Stanley Chow)
Organization: Bell-Northern Research, Ottawa, Canada
Lines: 71

In article <38853@bbn.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>
>  I'm not sure memory bandwidth has anything to do with RISC vs. CISC.
>Remember that there are (at least) two kinds of bandwidth: instructions,
>and data.  I guess I'll concede that RISC forces instruction bandwidth
>up, or requires somewhat larger instruction caches.  However data
>bandwidth is a much more severe limitation on certain programs.  I have
>in mind numerical or scientific codes, which spend most of their time in
>small loops (instructions all in cache) sweeping through large arrays
>(which may well not fit in cache).  The average scientific code appears
>to do roughly 1.5 memory references per floating-point operation.  Can
>your 10-Megaflop (64-bit precision) micro-processor move 120 Megabytes
>per second?  RISC vs. CISC seems largely irrelevant in this domain.
>
>	-- Richard
>	schooler@bbn.com

You are in effect saying the CPU architecture is not related to the
bandwidth requirement. I like to point out some ways that they do
interact. 

First of all, there are non-numeric programs, in fact, I would guess that
number crunching is no longer the major user of computing power. Some 
programs have very poor hit-rates in any cache. But, IMO, even in the number
crunching area, RISC is still sub-optimal. 

Second of all, I agree with you that data is a much harder problem. It is
here that I have the most trouble with RISC. It appears to me that to solve
the data bandwidth problem, one must give more information to the CPU. In
particular, a well designed architecture should work to minimize the impact
of data latency.  The basic premise of RISC is to not telll the CPU anything
until the last moment. This strikes me as a funny way of optimizing throughput.

To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access
intructions, 1.5 address adjusting instructions, say 0.5 instructions for
boundary condition checking and 0.5 jump instructions. This adds up to 5
instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2
instructions.

I can hear it now, everyone is jumping up and down saying, "what a fool,
doesn't he know that all those cycles are free?", "Hasn't he heard of 
pipelining and register scoreboarding?", "but the CISC instruction are slower
so the RISC will still run faster." 

In response, I can only say, work through some real examples and see
how many cycles are wasted. Alternatively, see how many stages of
pipelining is needed to have no wasted cycles. A suitable CISC will find
out earlier that it will be doing another memory reference and can prepare
accordingly. It is even possible to have scatter/gather type hardware to
offload the CPU while maximizing data throughput.

"Compilers can do optimizations", I hear the yelling. This is another
interesting phenomenon - reduce the complexity in the CPU so that the 
compiler must do all these other optimizations. I have also now seem any
indications that a compiler can to anywhere close to an optimal job on
scheduling code or pipelining. Even discounting the NP-completeness of
just about everything, theoratical indications point the other way,
especially when the compiler has to juggle so many conflicting constraints.

It would be interesting to speculate on total system complexity, is it
higher for CISC or for RISC (with its attendent memory and compiler
requirements).


Stanley Chow  ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public


As soon as flames start to show up, I will probably disown these
opinions to save my skin, at which point, these opinions will no 
longer represent anyone at all. Anyone wishing to be represented 
by these opinions need only say so.