Path: utzoo!attcan!utgpu!bnr-vpa!bnr-fos!bnr-public!schow From: schow@bnr-public.uucp (Stanley Chow) Newsgroups: comp.arch Subject: Re: Bandwidth and RISC vs. CISC Message-ID: <423@bnr-fos.UUCP> Date: 20 Apr 89 06:06:36 GMT References: <38853@bbn.COM> Sender: news@bnr-fos.UUCP Reply-To: schow@bnr-public.UUCP (Stanley Chow) Organization: Bell-Northern Research, Ottawa, Canada Lines: 71 In article <38853@bbn.COM> schooler@oak.bbn.com (Richard Schooler) writes: > > I'm not sure memory bandwidth has anything to do with RISC vs. CISC. >Remember that there are (at least) two kinds of bandwidth: instructions, >and data. I guess I'll concede that RISC forces instruction bandwidth >up, or requires somewhat larger instruction caches. However data >bandwidth is a much more severe limitation on certain programs. I have >in mind numerical or scientific codes, which spend most of their time in >small loops (instructions all in cache) sweeping through large arrays >(which may well not fit in cache). The average scientific code appears >to do roughly 1.5 memory references per floating-point operation. Can >your 10-Megaflop (64-bit precision) micro-processor move 120 Megabytes >per second? RISC vs. CISC seems largely irrelevant in this domain. > > -- Richard > schooler@bbn.com You are in effect saying the CPU architecture is not related to the bandwidth requirement. I like to point out some ways that they do interact. First of all, there are non-numeric programs, in fact, I would guess that number crunching is no longer the major user of computing power. Some programs have very poor hit-rates in any cache. But, IMO, even in the number crunching area, RISC is still sub-optimal. Second of all, I agree with you that data is a much harder problem. It is here that I have the most trouble with RISC. It appears to me that to solve the data bandwidth problem, one must give more information to the CPU. In particular, a well designed architecture should work to minimize the impact of data latency. The basic premise of RISC is to not telll the CPU anything until the last moment. This strikes me as a funny way of optimizing throughput. To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access intructions, 1.5 address adjusting instructions, say 0.5 instructions for boundary condition checking and 0.5 jump instructions. This adds up to 5 instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2 instructions. I can hear it now, everyone is jumping up and down saying, "what a fool, doesn't he know that all those cycles are free?", "Hasn't he heard of pipelining and register scoreboarding?", "but the CISC instruction are slower so the RISC will still run faster." In response, I can only say, work through some real examples and see how many cycles are wasted. Alternatively, see how many stages of pipelining is needed to have no wasted cycles. A suitable CISC will find out earlier that it will be doing another memory reference and can prepare accordingly. It is even possible to have scatter/gather type hardware to offload the CPU while maximizing data throughput. "Compilers can do optimizations", I hear the yelling. This is another interesting phenomenon - reduce the complexity in the CPU so that the compiler must do all these other optimizations. I have also now seem any indications that a compiler can to anywhere close to an optimal job on scheduling code or pipelining. Even discounting the NP-completeness of just about everything, theoratical indications point the other way, especially when the compiler has to juggle so many conflicting constraints. It would be interesting to speculate on total system complexity, is it higher for CISC or for RISC (with its attendent memory and compiler requirements). Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public As soon as flames start to show up, I will probably disown these opinions to save my skin, at which point, these opinions will no longer represent anyone at all. Anyone wishing to be represented by these opinions need only say so.