Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!bcm!dimacs.rutgers.edu!mips!lloyd!cprice
From: cprice@mips.COM (Charlie Price)
Newsgroups: comp.arch
Subject: Re: R4000
Message-ID: <45792@mips.mips.COM>
Date: 11 Feb 91 21:52:07 GMT
References: <45448@mips.mips.COM> <1991Feb1.223326.18683@watdragon.waterloo.edu> <45525@mips.mips.COM> <obgVpm200VpeQ4VmIa@andrew.cmu.edu>
Sender: news@mips.COM
Reply-To: cprice@mips.COM (Charlie Price)
Organization: MIPS Computer Systems, Inc
Lines: 76

In article <obgVpm200VpeQ4VmIa@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes:
>isn't MIPS's "superpiplining" just the common trick 
>of sticking in a clock doubler?

Superscalar is pretty easy to define, but
what *does* superpipelining really mean?

At least one definition is that it is an implementation in which
extra stages are added to a "normal" pipeline simply to decrease
the clock interval and increase the issue rate.
The R4000 qualifies by this measure.

Another view is that a regular pipeline issues one instruction
per I-cache access latency period.
A superpipeline issues two or more instructions during the
cache access latency.
The R4000 also qualifies by this measure.

One superpipeline "feature" that the R4000 does NOT have,
is a multi-stage ALU.
The designers squeezed very hard to get the ALU into one clock.
This is a "good thing" and an important detail of the design.
It makes it possible for the result of an ALU operation
to be available (by bypassing) to the ALU stage of the following operation.
This means that the R4000 has no issue restrictions;
this instruction sequence can be issued in the same external cycle:
	sub	r1 from r2, result in r3
	or	r3 with r4, result in r5


Pipeline details for the curious ( "|" denotes parallel operation):

The R3000 pipeline, has 5 stages:

IF	Instruction fetch from I-cache
RF	Register Fetch | instruction decode
ALU	ALU op or load/store address computation
MEM	D-cache access
WB	WriteBack results to register file

The cache access time is one clock period and
an instruction is issued in each clock period.
This is an incomplete description, and parts of the processor are
used twice per cycle in a first-half, second-half staggered fashion,
but note that the ALU occupies a whole clock.

The R4000 has an 8-stage pipeline that takes 4 EXTERNAL clocks:

IF	I-fetch, First cycle || instr address translation
IS	I-fetch, Second cycle || instr address translation

RF	Register Fetch |  instruction decode | tag check of I-cache entry
EX	ALU or load/store address computation

DF	D-cache access, First cycle  | data address translation
DS	D-cache access, Second cycle | data address translation

TC	Tag Check of D-cache entry
WB	WriteBack to register file

Two instructions are issued per EXTERNAL clock,
this is the same period as the on-chip cache latency.
To do this, an internal clock runs at double the external clock and
one instruction is issued per internal clock so
This requires that cache access is pipelined.

This is much like the 3K pipeline except that the cache access
was chopped into two stages, and the D-cache tag check
needed a separate stage before writeback.
The RegisterFetch, EXecute, and WriteBack stages do roughly the same
work as before, just faster.
Squeezing the ALU into one clock required a faster adder.

-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650