Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Why floating point hardware: micro-parallelism, micro-cycles Message-ID: <41518@mips.mips.COM> Date: 14 Sep 90 23:41:34 GMT References: <197@validgh.com> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Inc. Lines: 219 In article Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes: >>>>>> On 9 Sep 90 15:17:44 GMT, dgh@validgh.com (David G. Hough on validgh) said: >David> ...since floating-point instructions can be decomposed into simple >David> integer operations, how can they be justified in a RISC >David> architecture? Why is it that they don't run as fast in software? >David> (They don't, and can't, but you might have to try it to convince >David> yourself. All you need to do is look at 64-bit double precision >David> floating-point add/subtract on a 32-bit RISC architecture). > >David> Basically I was attacking the idea that RISC = 'a few simple >David> instructions'. This was an overly simple definition anyway. The >David> correct definition of RISC architecture is 'good engineering' in the >David> sense of 'good engineering economy', although not everybody has >David> realized this yet. > >Perhaps RISC does indeed stand for Reduced Instruction Set, and "good >engineering" can, and has, been applied to CISC architectures (notably the >80486 and the 68040). > >Modern processor design is indeed indebted to the RISC pioneers who, in >order to compensate for reduced instruction sets, applied "good >engineering" to come up with some remarkable techniques for parallelism. >_Except for the reduced number of instructions_, these same techniques can >be applied to CISC (albeit some techniques with more difficulty). > >If a CISC processor _averages_ close to 1 Cycle Per Instruction, what is >the advantage of removing many of those instructions? Are you claiming a >CISC processor is somehow transformed into a RISC processor because of an >improved CPI, _even though the actual instruction set has not diminished_? >(e.g. the 68040 & 80486) > >In a given technology, the physics of the medium limits how fast a switch >can toggle, leaving parallelism as the route for even greater throughput. >It appears Reduced Instruction Sets and parallelism are, to a great degree, >orthagonal. Am I missing something here? > >Is it possible higher silicon densities will shift (or have shifted) the >economics of processor design toward more robust parallelized instruction >sets, perhaps even toward "Super CISC"? > > Just for discussion, > >David> David Hough >David> dgh@validgh.com uunet!validgh!dgh na.hough@na-net.stanford.edu > >#include >-- >Chuck Phillips MS440 >NCR Microelectronics Chuck.Phillips%FtCollins.NCR.com >2001 Danfield Ct. >Ft. Collins, CO. 80525 uunet!ncrlnk!ncr-mpd!bach!chuckp Newsgroups: comp.arch Subject: Re: Why floating point hardware: micro-parallelism, micro-cycles Summary: Expires: References: <197@validgh.com> Sender: Reply-To: mash@mips.COM (John Mashey) Followup-To: Distribution: Organization: MIPS Computer Systems, Inc. Keywords: There are a bunch of things in the following discussion that could use some clarification, or amplification, so here goes: In article Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes: >>>>>> On 9 Sep 90 15:17:44 GMT, dgh@validgh.com (David G. Hough on validgh) said: >David> ...since floating-point instructions can be decomposed into simple >David> integer operations, how can they be justified in a RISC >David> architecture? Why is it that they don't run as fast in software? >David> (They don't, and can't, but you might have to try it to convince >David> yourself. All you need to do is look at 64-bit double precision >David> floating-point add/subtract on a 32-bit RISC architecture). >David> Basically I was attacking the idea that RISC = 'a few simple >David> instructions'. This was an overly simple definition anyway. The >David> correct definition of RISC architecture is 'good engineering' in the >David> sense of 'good engineering economy', although not everybody has >David> realized this yet. Dgh has this right about FP (note that on a MIPS, 64-bit FP add = 2 cycles, hard to match by sequences of integer instructions),` and it is a good example of what people really do, without the confusion of counting instructions. >Perhaps RISC does indeed stand for Reduced Instruction Set, and "good >engineering" can, and has, been applied to CISC architectures (notably the >80486 and the 68040). Good engineering can be of course applied to CISCs, and has been, for years. If you track succeeding designs among, for example, the S/360 & VAX families, you will find that the designers have carefully studied the statistics of program behavior, moved some instructions from microcode into hardware, or vice-versa, or even into software emulation. Examples include: 360/44 (didn't have decimal ops, for example) MicroVAX II (also didn't have decimal ops) In addition, successive designs have generally gotten more efficient pipeline designs and memory hierarchies. Certainly, the 80486 is a fine implementation, the 68040 appears to be well-thought-out, from the published information. This whole process, in general, goes on amongst all competent computer designers, and has been, for many years, and is not particularly new, nor would I expect that any knowledgable RISC designer tell you that is was something magic and new. So what's the difference: let's try again: RISC micros were designed from the beginning: 1) To avoid instruction complexity that would require microcode in general, which often costs you 1.5-2 : 1 if used for the simpler instructions. 2) (In better cases) with a great deal of input from software people. Since RISCs are newer, they have a lot of benefit from hindsight. Since RISCs were designed when there was considerable more use of high-level languages and (sometimes) optimizing compilers, it was much easier to study these things and input them into the design. AS it happens, compiler technology has taken leaps in the last decade, and the tradeoffs have changed, not suprising, since the entire nature & structure of the computer business is a lot different from 10 years ago, and unbelievably different from 20 years ago. 3) RISCs usually were designed after it was clear that caches were good things, and that let them make tradeoffs from Day 1, tradeoffs that were not necesasrily appropriate for architectures designed when caches were either unknown or not practical for the part of the design space being attacked. Also in this category are: a) Pure code segments b) Virtual memory support, if needed In some cases, some older machines allowed programs to write into their code any time they felt like it (like into the immediately suceeding instruction), or they included features that conflicted morewith VM than they need to have. All of these can be worked around, but hindsight... 4) RISCs are generally designed to permit clean, simple pipelining, without requiring huge amounts of logic for special cases and such. This is certainly one of the key differences, and again, some of it comes from hindsight. 5) Avoid those instructions that can easily be simulated by sequences of simple ones AT COMPARABLE PERFORMANCE. Include those instructions, NO MATTER HOW "COMPLEX" someone thinks they are, if those instructions achieve performance that cannot be approximated elsewise, and if the tradeoffs are acceptable. (again: include FP Add, which may well be a huge hunk of hardware, but don't include Translate&Test). It is interesting, as H&P point out, that never in the history of computing have bunch of ISA (note: just ISAs, nothing said about architecture in general) designs done at the same time resembled each other as much as the current crop of RISCs do. (This is where they describe several different chips by showing their relatively minor differences from their DLX). This doesn't mean there aren't important diifferences among them, but machines that have 32-bit instructions, load/store orientation, usually 32 integer registers available at once, etc, etc, are a lot more alike, than, say: IBM 1401, IBM 7074, and IBM 7094, or S/360, CDC 6600, Univac 1108, or VAX & DG MV, or Intel 8086, Moto 68000, and NSC 32K. >Modern processor design is indeed indebted to the RISC pioneers who, in >order to compensate for reduced instruction sets, applied "good >engineering" to come up with some remarkable techniques for parallelism. >_Except for the reduced number of instructions_, these same techniques can >be applied to CISC (albeit some techniques with more difficulty). As noted, good engineering practice is good engineering practice, and it didn't start with RISC. However, the reduced number of instructions is the LEAST of the issues, and people keep getting confused with this. Much more relevant are issues like: Operand and instruction alignment, especially in VM systems Number and especially kinds of addressing modes, especially multi-level indirect, for example. Number & size of operand fetches/writes caused by an instruction Multiple instruction sizes Number and kind of side-effects caused by an instruction, especially in VM systems Exception model > >If a CISC processor _averages_ close to 1 Cycle Per Instruction, what is >the advantage of removing many of those instructions? Are you claiming a >CISC processor is somehow transformed into a RISC processor because of an >improved CPI, _even though the actual instruction set has not diminished_? >(e.g. the 68040 & 80486) Well, so far, 80486s don't appear to average close to 1 CPI, although, as I've pointed out before, only the designers really know. On the other hand, if you approximate CPI by MHz/(Integer-VAX-mips), for machines for whichtaht makes sense, and use SPEC integer = Integer-vax-mips, you get numbers like: (from "Your Mileage May Vary, Issue 2.0): Clock SPEC-Int Clock/SPEC Chip System 25 11.2 2.23 SPARC SUN SS1+ w/s (64K cache) 25 12.3 2.03 SPARC Sun SS330 w/s (128K cache) 25 13.3 1.88 486 Intel-reported (128K) 25 18.3 1.37 88K Moto 8864SP (128K) 25 19.4 1.29 R3000 MIPS Magnum 3000 w/s (64K) 25 19.7 1.27 R3000 MIPS M/2000, RC3260 (128K) 25 20.2 1.24 RS/6000 IBM RS/6000 model 530 w/s (72K) Note, of course, that there is some element of apples&oranges here, as these things are not completely contemporaneous in design, have sometimes rather different silicon budgets, etc. Still, if you believe clock/SPEC is anywhere near close to CPI for these machines (it is for MIPS, but that's the only one I can be sure of), the 486 is still off by factor of 2. (Mainframes would get closer to 1, I think, and I suspect the '040 will do al ittle better also.) Of course, doing a heavily-streamlined implementation of a VAX, X86, 68K, etc ... doesn't magically make them RISC architectures, but of course, one shouldn't care much, either (except for marketing :-). The engineers are doing what they should be: making them go faster. Of course, they sometimes have to squeeze harder to get everything in. I have high respect for the implementation cleverness that has often gone into such things, because it is VERY HARD WORK to make ANYTHING go really fast, and people have to leave with past decisions. Consider people who build mainframes (IBM & PCMs): they must live with decisions made 25 years ago.... -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086