Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!amdcad!crackle!tim From: tim@crackle.amd.com (Tim Olson) Newsgroups: comp.arch Subject: Re: [HS]W interlocks (was: Fujitsu SPARC Interlocks) Message-ID: <24435@amdcad.AMD.COM> Date: 14 Feb 89 05:27:48 GMT References: <28200269@mcdurb> <28200273@mcdurb> <3007@ardent.UUCP> <14619@cup.portal.com> Sender: news@amdcad.AMD.COM Reply-To: tim@amd.com (Tim Olson) Organization: Advanced Micro Devices, Inc. Sunnyvale CA Lines: 30 Summary: Expires: Sender: Followup-To: In article <14619@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes: | The MIPS processors (R2000 and R3000) are the only commercial uPs I'm aware | of that are not fully interlocked; are there others? (Not counting delayed | branches, of course, which everyone does.) | | The MIPS architecture definition has one load delay slot. Processors that | have longer load latency will simply require interlocks. John Hennessy | contends that it will never make sense to build a processor with no load | delay slot. As I understand it, his argument is that even with on-chip cache, | the register file will be faster to access than the cache, and if there is no | delay slot, then the machine isn't running as fast as it could and would be | better off with a faster clock and a load delay slot. | | Anyone disagree? Will there be pipelined uPs that have no delay slot? If an on-chip I-cache can be built that will supply an instruction in a single-cycle (which it *has* to, in order to run at 1 inst/cycle), why can't a D-cache with the same characteristics exist? If there is a load in the execute stage, then TLB translation can occur in parallel with D-cache lookup, resulting in a value that can be forwarded to the ALU for use in the very next instruction. A single delay slot, with good scheduling, still causes about a 5% to 6% pipeline stall (or equivalent nop execution) which could be reduced with a fast on-chip D-cache. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)