Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!bbn!bbn.com!slackey
From: slackey@bbn.com (Stan Lackey)
Newsgroups: comp.arch
Subject: Re: More RISC vs. CISC wars
Message-ID: <42688@bbn.COM>
Date: 13 Jul 89 14:57:16 GMT
References: <42621@bbn.COM> <13984@lanl.gov>
Sender: news@bbn.COM
Reply-To: slackey@BBN.COM (Stan Lackey)
Organization: Bolt Beranek and Newman Inc., Cambridge MA
Lines: 80

The discussion continues between jlg@lanl.gov (Jim Giles) and me.  If
you are bored with it, "Type 'n' now"


In article <13984@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
>> In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>>And, how many microcycles does 'one cycle' on the Alliant correspond
>>>to?  
>> 
>> One.  The reason many, even memory-to-register, operations take one
>> microcycle is because it has a scalar pipeline.  Even though pipelines
>> "can't-be-done" on CISC's.

I aplogize for the sarcasm.  I have seen too many "can't be done in a
CISC" or "is too hard to do in a CISC" statements, referring to things
I have done in a CISC.

>You are either using pipelines (in which case the instruction _issues_
>in one clock, but the result is not delivered for several more), or
>you aren't (in which case, I don't believe your claim that the instruction
>has no microcycles).

The basic clock to the Alliant CE is 170ns.  One new microword is
accessed every 170ns cycle.  Many instructions consume one 170ns
cycle.  Some consume more.  FADD.D (ay)+, fp0 consumes one.  FDIV.D
<ea>, fp0 consumes more than one, like 3 or 4.

>Now that you've said that the Alliant is pipelined, you have to tell
>be what the _real_ instruction timing for the given example is.  What
>is the minimum number of clocks between issuing the given instruction
>and issuing the next instruction which uses one of the results of the
>one given?  Bet it ain't 1.

Bet it is, for lots of cases.

The CE has a fixed six-stage pipeline.  The stages are:

1. Instruction cache access and instruction decode

2. Address calculation and microcode access

3. Address translation and passing the address through the crossbar

4. Cache access and returning the data through the crossbar (on
   a read, send data on a write)

5. Integer execution or pass operands to floating point unit

6. Floating point execution and writing of results

So, the full execution time of a FP instruction is 6*170.  A new 
instruction can be started every 170.  

Dependencies cause dead cycles to be inserted.  These dependencies
include an integer operation being used as an address in the next
instruction, but do not include integer or floating point
dependencies; we used the 50ns BIT (Bipolar Integrated Technology)
functional units, and wired the data paths efficiently so that
dependent operands could be routed fast enough.

In the implementation, only one microword is accessed for the entire
instruction.  It is a very wide microinstruction, and fields of it
that are destined to control operations later in the instruction are
delayed by "pipeline registers".  The technique was called "data
stationary control" in the textbook we got it out of.  Lore has it
that IBM has used this style in their mainframes, and calls it
"delayed microinstructions" or something similar.

Note: Because condition codes are not available to a branch instruction
following a compare, branch prediction is employed.

Also note: the above strategy seems to work real well.  Compare its
Whets with the 1989 50ns RISC machines.

My opinions, and not necessarily those of Alliant Computer Systems,
Internation Business Machines, BBN, the publisher of the textbook we
got "data stationary" out of, and anybody else, living or dead, whom
I may have mentioned.
 :-) Stan