Path: utzoo!attcan!uunet!husc6!cs.utexas.edu!oakhill!mpaton From: mpaton@oakhill.UUCP (Michael Paton) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Keywords: RISC pipeline memory power load latency Message-ID: <1362@oakhill.UUCP> Date: 1 Jul 88 20:57:15 GMT Organization: Motorola Inc. Austin, Tx Lines: 120 In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >> b) If so, what is the reason for the second latency slot? The reason for the additional delay slot is twofold. First, the M88100 has allotted .2 clocks to the data path, .7 clocks to a 32 bit adder, and more than .25 of a clock for driving a signal off chip with .1 clock setup margins. Since we desired all signals to be synchronous, we rounded this 1.25 clocks up to 1.5 clocks. These 1.5 clocks are spent performing the addition and delivering the address to the M88200 cache. The cache has 1 clock from valid address to valid data (1.2 valid address to hit). Then the processor uses the remaining .5 clock to perform byte extraction, sign extension, delivering the result to the result bus. Therefore the memory pipeline is 3 clocks in length. The MIPS Co. designers sliced this into 2 clocks and integrated it into their basic (only) pipeline. This means that when the data cache misses or is not available (e.g. bus snooping) that their entire processor skips a beat. Since the M88100 data pipeline is decoupled from the instruction pipeline, the beat is not skipped unless a data dependency occurs. This is more beneficial in the case of outbound store traffic on write-through pages than on the nominal load traffic. In two clocks they must perform a 32+16 addition and the mmu function before driving the pins. Moreover, the multiplexed bus requires that the addresses be latched (externally on both the R2000 and the R3000). Memory must still be accessed and delivered to the result bus/register file/forwarding unit. Therefore, the time must come from somewhere, and it seems to have come out of the memory SRAM speed in the R3000 and the circuit speeds and margins. According to the data sheet (using a F373 transparent latch), the MIPS design requires 20ns SRAMs. Our design (assuming a glue part cache implementation: F373 latch and 3ns t-sub-OHA* SRAM with spec sheet timing) allows 32ns SRAM parts at 40ns cycle or 42ns SRAM at 50ns. The MIPS design requires 0ns SRAM at 40 MHz, while we do not hit this brick wall until 120 MHz. Since the two SRAM arrays' data pins are connected together and to three other interfaces, they must be sampled at 2/3 of the phase period by all interested parties and then rapidly removed from the multiplexed data bus (the data is only valid for 3.5ns on the data bus). This requires very strict timing control over this vital feature. In addition, the Cypress spec's I checked had t-sub-HZOE** of 15ns maximum while the R3000 requires this in 8ns. Note that this is not a problem as the other bank of SRAMs has a minimum output enable time of 9ns while the Cypress parts (CY7C161 16K*4) did not. The MIPS processors do not snoop their bus and therefore leave memory coherence to the write-through mechanism. In multiprocessing applications, the memory bus can become saturated with a few processors on the bus (~4?). Write-back caches cause a sufficient reduction in memory bus traffic to allow twice the number of processing ensembles to utilize the bus. Therefore, the answer to the original question is: we were more conservative in allotting nanoseconds to functions, and, in particular, we attempted to beat on the SRAM technology less hard. If we are correct, this should be more scalable in the future (read ECL/GaAs) as off-chip delays approach .4 cycle. Alternatively, our design costs less to manufacture in high volume and allow less costly SRAM parts than the MIPS Co. design. This should allow us to push the clock frequency farther than MIPS Co., but not at the current mask set. Remember the recent announcement of 33 MHz M68020 and M68030: a little shrink and a lot of tuning can obtain many MHz. We have just begun to tune these M88k parts. >>Note that our numbers say that in our machines, it would cost us >>10-15% in overall performance to go from 1 cycle latency to 2, Our numbers show that with the decoupling of the pipeline and earliest load scheduling we suffer 6-8% cycle-to-cycle performance degradation when comparing the MIPS design to the M88100 design. This is not too bad when one considers the relaxation of external RAM speed requirements and allowing the hit logic to run in arrears of the data delivery. Independent of this current discussion on the length of load pipelines, one might want to ask the folks at MIPS Co. this question: Why did you multiplex your memory bus? Consider factors related to power dissipation. The current M88100 processors are running between .25 and .5 watts @ 20Mhz. If we were to multiplex the 2 memory ports as did MIPS Co., our worst case power consumption would be 4 watts. The problem is that the AC power dissipation is given by: 2*C*V**2*F*N, where: C = load capacitance (F), V = voltage swing (V), F = frequency (Hz), and N = number of pins which make transitions. (For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz) The addresses from the instruction port are very highly correlated (about 1.4 bits per cycle change). The addresses from the data port are only partially correlated (less so with better compilers). Mixing these two streams results in almost uncorrelated address streams and therefore a bigger N, resulting in more power dissipation. Notice that the pin counts on the two packages are not that much different (144 for the R3000 vs. 180 for the MC88100) and neither are the power/grounds pin counts (30 for the R3000 vs. 36 for the MC88100). So why multiplex the memory bus? Your pin count isn't that much lower, your power dissipation suffers greatly and you tend to create difficulties in interfacing to your memory system. * t = time of output hold from address change OHL ** t = time of output enable high to high Z (impedance) HZOE /\ /\ Mitch Alsup //\\ //\\ Manager of Architecture for the M88000 ///\\\ ///\\\ // \\ // \\ Remember: SPARC spelled backwards is .....! / \/ \ / \