Path: utzoo!attcan!uunet!husc6!cs.utexas.edu!oakhill!mpaton
From: mpaton@oakhill.UUCP (Michael Paton)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Keywords: RISC pipeline memory power load latency
Message-ID: <1362@oakhill.UUCP>
Date: 1 Jul 88 20:57:15 GMT
Organization: Motorola Inc. Austin, Tx
Lines: 120

In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:

>>	b) If so, what is the reason for the second latency slot?

The reason for the additional delay slot is twofold.  First, the M88100
has allotted .2 clocks to the data path, .7 clocks to a 32 bit adder,
and more than .25 of a clock for driving a signal off chip with .1
clock setup margins.  Since we desired all signals to be synchronous,
we rounded this 1.25 clocks up to 1.5 clocks.  These 1.5 clocks are
spent performing the addition and delivering the address to the M88200
cache.  The cache has 1 clock from valid address to valid data (1.2
valid address to hit).  Then the processor uses the remaining .5 clock
to perform byte extraction, sign extension, delivering the result to
the result bus.  Therefore the memory pipeline is 3 clocks in length.

The MIPS Co. designers sliced this into 2 clocks and integrated it into
their basic (only) pipeline.  This means that when the data cache
misses or is not available (e.g. bus snooping) that their entire
processor skips a beat.  Since the M88100 data pipeline is decoupled
from the instruction pipeline, the beat is not skipped unless a data
dependency occurs.  This is more beneficial in the case of outbound
store traffic on write-through pages than on the nominal load traffic.

In two clocks they must perform a 32+16 addition and the mmu function
before driving the pins.  Moreover, the multiplexed bus requires that
the addresses be latched (externally on both the R2000 and the R3000).
Memory must still be accessed and delivered to the result bus/register
file/forwarding unit.  Therefore, the time must come from somewhere,
and it seems to have come out of the memory SRAM speed in the R3000 and
the circuit speeds and margins.  According to the data sheet (using a
F373 transparent latch), the MIPS design requires 20ns SRAMs.  Our
design (assuming a glue part cache implementation: F373 latch and 3ns
t-sub-OHA* SRAM with spec sheet timing) allows 32ns SRAM parts at 40ns cycle
or 42ns SRAM at 50ns.  The MIPS design requires 0ns SRAM at 40 MHz,
while we do not hit this brick wall until 120 MHz.

Since the two SRAM arrays' data pins are connected together and to
three other interfaces, they must be sampled at 2/3 of the phase period
by all interested parties and then rapidly removed from the multiplexed
data bus (the data is only valid for 3.5ns on the data bus). This
requires very strict timing control over this vital feature.  In
addition, the Cypress spec's I checked had t-sub-HZOE** of 15ns maximum
while the R3000 requires this in 8ns.  Note that this is not a problem
as the other bank of SRAMs has a minimum output enable time of 9ns
while the Cypress parts (CY7C161 16K*4) did not.

The MIPS processors do not snoop their bus and therefore leave memory
coherence to the write-through mechanism.  In multiprocessing
applications, the memory bus can become saturated with a few processors
on the bus (~4?).  Write-back caches cause a sufficient reduction in
memory bus traffic to allow twice the number of processing ensembles to
utilize the bus.

Therefore, the answer to the original question is: we were more
conservative in allotting nanoseconds to functions, and, in particular,
we attempted to beat on the SRAM technology less hard.  If we are
correct, this should be more scalable in the future (read ECL/GaAs) as
off-chip delays approach .4 cycle.  Alternatively, our design costs
less to manufacture in high volume and allow less costly SRAM parts
than the MIPS Co. design.

This should allow us to push the clock frequency farther than MIPS Co.,
but not at the current mask set.  Remember the recent announcement of
33 MHz M68020 and M68030: a little shrink and a lot of tuning can
obtain many MHz.  We have just begun to tune these M88k parts.

>>Note that our numbers say that in our machines, it would cost us
>>10-15% in overall performance to go from 1 cycle latency to 2,

Our numbers show that with the decoupling of the pipeline and earliest
load scheduling we suffer 6-8% cycle-to-cycle performance degradation
when comparing the MIPS design to the M88100 design.   This is not too
bad when one considers the relaxation of external RAM speed requirements
and allowing the hit logic to run in arrears of the data delivery.

Independent of this current discussion on the length of load pipelines,
one might want to ask the folks at MIPS Co. this question:

	      Why did you multiplex your memory bus?

Consider factors related to power dissipation.  The current M88100
processors are running between .25 and .5 watts @ 20Mhz.  If we were to
multiplex the 2 memory ports as did MIPS Co., our worst case power
consumption would be 4 watts.  The problem is that the AC power
dissipation is given by:

                            2*C*V**2*F*N,          
where:
                    C = load capacitance (F),
                    V = voltage swing (V),
                    F = frequency (Hz), and
                    N = number of pins which make transitions.

      (For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz)

The addresses from the instruction port are very highly correlated
(about 1.4 bits per cycle change).  The addresses from the data port
are only partially correlated (less so with better compilers).  Mixing
these two streams results in almost uncorrelated address streams and
therefore a bigger N, resulting in more power dissipation.  Notice that
the pin counts on the two packages are not that much different (144 for
the R3000 vs. 180 for the MC88100) and neither are the power/grounds
pin counts (30 for the R3000 vs. 36 for the MC88100).

So why multiplex the memory bus?  Your pin count isn't that much lower,
your power dissipation suffers greatly and you tend to create
difficulties in interfacing to your memory system.

*  t     = time of output hold from address change
    OHL

** t     = time of output enable high to high Z (impedance)
    HZOE

       /\        /\        Mitch Alsup
      //\\      //\\       Manager of Architecture for the M88000
     ///\\\    ///\\\      
    //    \\  //    \\     Remember: SPARC spelled backwards is .....!
   /        \/        \
  /                    \