Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!bcm!dimacs.rutgers.edu!mips!lloyd!cprice From: cprice@mips.COM (Charlie Price) Newsgroups: comp.arch Subject: Re: R4000 Message-ID: <45792@mips.mips.COM> Date: 11 Feb 91 21:52:07 GMT References: <45448@mips.mips.COM> <1991Feb1.223326.18683@watdragon.waterloo.edu> <45525@mips.mips.COM> Sender: news@mips.COM Reply-To: cprice@mips.COM (Charlie Price) Organization: MIPS Computer Systems, Inc Lines: 76 In article mh2f+@andrew.cmu.edu (Mark Hahn) writes: >isn't MIPS's "superpiplining" just the common trick >of sticking in a clock doubler? Superscalar is pretty easy to define, but what *does* superpipelining really mean? At least one definition is that it is an implementation in which extra stages are added to a "normal" pipeline simply to decrease the clock interval and increase the issue rate. The R4000 qualifies by this measure. Another view is that a regular pipeline issues one instruction per I-cache access latency period. A superpipeline issues two or more instructions during the cache access latency. The R4000 also qualifies by this measure. One superpipeline "feature" that the R4000 does NOT have, is a multi-stage ALU. The designers squeezed very hard to get the ALU into one clock. This is a "good thing" and an important detail of the design. It makes it possible for the result of an ALU operation to be available (by bypassing) to the ALU stage of the following operation. This means that the R4000 has no issue restrictions; this instruction sequence can be issued in the same external cycle: sub r1 from r2, result in r3 or r3 with r4, result in r5 Pipeline details for the curious ( "|" denotes parallel operation): The R3000 pipeline, has 5 stages: IF Instruction fetch from I-cache RF Register Fetch | instruction decode ALU ALU op or load/store address computation MEM D-cache access WB WriteBack results to register file The cache access time is one clock period and an instruction is issued in each clock period. This is an incomplete description, and parts of the processor are used twice per cycle in a first-half, second-half staggered fashion, but note that the ALU occupies a whole clock. The R4000 has an 8-stage pipeline that takes 4 EXTERNAL clocks: IF I-fetch, First cycle || instr address translation IS I-fetch, Second cycle || instr address translation RF Register Fetch | instruction decode | tag check of I-cache entry EX ALU or load/store address computation DF D-cache access, First cycle | data address translation DS D-cache access, Second cycle | data address translation TC Tag Check of D-cache entry WB WriteBack to register file Two instructions are issued per EXTERNAL clock, this is the same period as the on-chip cache latency. To do this, an internal clock runs at double the external clock and one instruction is issued per internal clock so This requires that cache access is pipelined. This is much like the 3K pipeline except that the cache access was chopped into two stages, and the D-cache tag check needed a separate stage before writeback. The RegisterFetch, EXecute, and WriteBack stages do roughly the same work as before, just faster. Squeezing the ALU into one clock required a faster adder. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650