Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!rice!uw-beaver!milton!dali.cs.montana.edu!uakari.primate.wisc.edu!zaphod.mps.ohio-state.edu!rpi!uwm.edu!lll-winken!sun-barr!newstop!sun!amdcad!mozart.amd.com!nucleus!davec From: davec@nucleus.amd.com (Dave Christie) Newsgroups: comp.arch Subject: Re: R4000 - compatibilty questions Message-ID: <1991Feb12.210800.9750@mozart.amd.com> Date: 12 Feb 91 21:08:00 GMT References: <49041@apple.Apple.COM> Sender: usenet@mozart.amd.com (Usenet News) Reply-To: davec@nucleus.amd.com (Dave Christie) Organization: Advanced Micro Devices, Austin, TX Lines: 96 Sorry, this has gotten rather long... In article <49041@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >[] >>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: >>Superscalar is pretty easy to define, but >>what *does* superpipelining really mean? > > (details of R4000 pipe... Thanks!) >> >>Two instructions are issued per EXTERNAL clock, >>this is the same period as the on-chip cache latency. >>To do this, an internal clock runs at double the external clock and >>one instruction is issued per internal clock so >>This requires that cache access is pipelined. > >This actually runs contrary to MY definition of superpipelined, which >is probably like MY definition of RISC - there's no such animal as THE >definition (for the curious, my definition requires all functional >units be multiple cycle- the R4000 behaves like a normal pipeline with >a multiple delay slot cache access [that's pipelined]). I was just about to post in a similar vein. When Jouppi coined the terms superscalar and superpipelined (at least his paper in ACM CAN a couple of years ago was the first I saw of the terms), he was comparing the two techniques in an apples/apples sort of way, stating that the most basic superpipeline operations have more than one pipestage of latency, and both are subject to similar dependency stalls. The R4000 is certainly aggressively pipelined, but not fully superpipelined. Only somewhat moreso than the 88K or 29050 for instance (disregarding the frequency) - one could say they're all superpipelined w.r.t floating point operations. Load operations in the R3000 could be called superpipelined - instructions are issued at twice the latency of load operations. The superpipelined criteria mentioned by Mr. Price in another posting are rather weak: issue rate 2x icache access? What if I decided to put in such a large icache that it increased my cycle time to almost 2x the capabilities of my other (R3K-like) stages, then merely made the cache double width and allowed two cycles to access it so I could still issue instructions at the faster rate - is it suddenly superpipelined? (If you say yes, you must be in marketing ;-). I'm certainly not trying to pick on the R4000 - it's a reasonably impressive design, and timely execution on the rest of the project will make it more so. It's more the term "superpipelined" - which is quickly becoming as meaninful as "RISC" and "MIPS". It's been well known since the early 60's that there are various degrees of aggressive pipelining; Jouppi refers to the the CDC 6600 and 7600 as superpipelined. These varying degrees of aggressiveness is what will make the term so slippery (hence a good marketing term. Can you tell I'm not in marketing?:-). What the R4000 has done for me at least is point out that superpipelining as defined and compared with superscalar by Jouppi is for the most part an academic exercise, at least with respect to the reality of the technologies we have to work with today: on chip cache sizes grow faster than their access times, but by the same token integer alu sizes (& latencies) shrink (alright, assuming constant word size...) so what might have been fairly balanced alu/cache access times in an older technology start to get very skewed - it makes perfect sense to split your cache access into two stages, but not your alu. (The same applies to superscalar designs.) One could go with a smaller, faster pipelined cache, split the alu and then maybe run at an even higher rate, BUT maybe not double and it would be more sensitive to operand dependencies. Moreover, smaller caches with higher execution rates is rather pointless. For these reasons, I don't think one could do an optimal fully-superpipelined implementation. (I don't know what the relative latencies of the alu and combined/split cache stages turned out to be for the R4000, but I'm tempted to speculate that one of the cache stages turned out to be still slower than a 32-bit alu, so going to 64 bits, while maybe even exceeding that cache stage by a bit, wasn't too painful, and therefore downright worthwhile. Just pure speculation, of course.) >I would expect a multiple cycle hit for taken branches as well, without >some kind of branch acceleration technique like a branch target cache. Yep. An extra cycle. At least they can still eat one cycle with the delay slot. This extra branch recovery cycle and the extra load cycle will cause non-uniform speed ups among various integer codes that aren't re-scheduled. >I'm also curious about the compatibility provisions. Is there a >'mode'? Are instructions 64 bits long now, or a mixture? Were just a few >new instructions added (like load/store double, shift double) and the >semantics of existing ones change (load becomes load signed/unsigned, >shifts become shift single, etc)? If there is a mode, does it just change >where in the word a condition is taken from? What else? Yeah, good questions. What's the cache impact of pointers taking up double the space? (I know this doesn't apply to 32-bit "mode", but I can't imagine that there'll be two sizes of pointers in 64-bit "mode".) This plus the extra load cycle won't look good for things like searching linked lists, but then I have yet to see any really aggressive high performance design provide balanced speedups across the spectrum. In any case, one can be sure Mipsco took all that into account for their target market. ------------------------------------------ Dave Christie My opinions only.