Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!bbn!bbn.com!slackey From: slackey@bbn.com (Stan Lackey) Newsgroups: comp.arch Subject: Re: More RISC vs. CISC wars Message-ID: <42688@bbn.COM> Date: 13 Jul 89 14:57:16 GMT References: <42621@bbn.COM> <13984@lanl.gov> Sender: news@bbn.COM Reply-To: slackey@BBN.COM (Stan Lackey) Organization: Bolt Beranek and Newman Inc., Cambridge MA Lines: 80 The discussion continues between jlg@lanl.gov (Jim Giles) and me. If you are bored with it, "Type 'n' now" In article <13984@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): >> In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >>>And, how many microcycles does 'one cycle' on the Alliant correspond >>>to? >> >> One. The reason many, even memory-to-register, operations take one >> microcycle is because it has a scalar pipeline. Even though pipelines >> "can't-be-done" on CISC's. I aplogize for the sarcasm. I have seen too many "can't be done in a CISC" or "is too hard to do in a CISC" statements, referring to things I have done in a CISC. >You are either using pipelines (in which case the instruction _issues_ >in one clock, but the result is not delivered for several more), or >you aren't (in which case, I don't believe your claim that the instruction >has no microcycles). The basic clock to the Alliant CE is 170ns. One new microword is accessed every 170ns cycle. Many instructions consume one 170ns cycle. Some consume more. FADD.D (ay)+, fp0 consumes one. FDIV.D , fp0 consumes more than one, like 3 or 4. >Now that you've said that the Alliant is pipelined, you have to tell >be what the _real_ instruction timing for the given example is. What >is the minimum number of clocks between issuing the given instruction >and issuing the next instruction which uses one of the results of the >one given? Bet it ain't 1. Bet it is, for lots of cases. The CE has a fixed six-stage pipeline. The stages are: 1. Instruction cache access and instruction decode 2. Address calculation and microcode access 3. Address translation and passing the address through the crossbar 4. Cache access and returning the data through the crossbar (on a read, send data on a write) 5. Integer execution or pass operands to floating point unit 6. Floating point execution and writing of results So, the full execution time of a FP instruction is 6*170. A new instruction can be started every 170. Dependencies cause dead cycles to be inserted. These dependencies include an integer operation being used as an address in the next instruction, but do not include integer or floating point dependencies; we used the 50ns BIT (Bipolar Integrated Technology) functional units, and wired the data paths efficiently so that dependent operands could be routed fast enough. In the implementation, only one microword is accessed for the entire instruction. It is a very wide microinstruction, and fields of it that are destined to control operations later in the instruction are delayed by "pipeline registers". The technique was called "data stationary control" in the textbook we got it out of. Lore has it that IBM has used this style in their mainframes, and calls it "delayed microinstructions" or something similar. Note: Because condition codes are not available to a branch instruction following a compare, branch prediction is employed. Also note: the above strategy seems to work real well. Compare its Whets with the 1989 50ns RISC machines. My opinions, and not necessarily those of Alliant Computer Systems, Internation Business Machines, BBN, the publisher of the textbook we got "data stationary" out of, and anybody else, living or dead, whom I may have mentioned. :-) Stan