Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!vsi1!wyse!mips!earl@wright.mips.com From: earl@wright.mips.com (Earl Killian) Newsgroups: comp.arch Subject: Re: [HS]W interlocks (was: Fujitsu SPARC Interlocks) Message-ID: <13273@wright.mips.COM> Date: 14 Feb 89 23:17:20 GMT References: <28200269@mcdurb> <28200273@mcdurb> <3007@ardent.UUCP> <14619@cup.portal.com> <24435@amdcad.AMD.COM> Sender: earl@mips.COM Reply-To: earl@wright.mips.com (Earl Killian) Organization: MIPS Computer Systems, Sunnyvale CA Lines: 36 In-reply-to: tim@crackle.amd.com (Tim Olson) In article <24435@amdcad.AMD.COM>, tim@crackle (Tim Olson) writes: >If an on-chip I-cache can be built that will supply an instruction in a >single-cycle (which it *has* to, in order to run at 1 inst/cycle), why >can't a D-cache with the same characteristics exist? If there is a load >in the execute stage, then TLB translation can occur in parallel with >D-cache lookup, resulting in a value that can be forwarded to the ALU >for use in the very next instruction. > >A single delay slot, with good scheduling, still causes about a 5% to 6% >pipeline stall (or equivalent nop execution) which could be reduced with >a fast on-chip D-cache. You can easily build a data cache with the same latency as your instruction cache. But you need to provide an address to that data cache, and it is the latency of the address formation + access that creates the 1-cycle minimum delay that John Hennessy referred to. Your statement is really only true in the context of the 29000 and similar machines, which have no address add stage (addresses are simply the contents of a register), and not for the MIPS instruction set, where the address is formed from a base register plus a signed 16-bit displacement. This "feature" of the 29000 is unusual, and I think it is mistake. You certainly can't use the fact it is possible to implement a delayless 29000 load to justify putting load interlocks into the MIPS architecture! I think Slater's question should have been "Will there ever be MIPS instruction set implementations that have no delay slot?" instead of "Will there be pipelined uPs that have no delay slot?" because the higher-level question was "What does MIPS lose by having a load delay slot instead of a load interlock?". I agree with Hennessy that the load delay slot will never cost MIPSco performance, except for a small increase in the I-cache miss rate. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086