Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!brutus.cs.uiuc.edu!apple!oliveb!mipos3!omepd!mipon2.intel.com!mcg From: mcg@mipon2.intel.com (Steven McGeady) Newsgroups: comp.arch Subject: Re: 55 MIPS & 66 MIPS [really: i960, i860, etc] Message-ID: <5276@omepd.UUCP> Date: 28 Nov 89 02:39:06 GMT References: <3044@brazos.Rice.edu> <1358@bnr-rsc.UUCP> <31329@winchester.mips.COM> <22303@gryphon.COM> <3024@brazos.Rice.edu> <31659@winchester.mips.COM> Sender: news@omepd.UUCP Reply-To: mcg@mipon2.intel.com (Steven McGeady) Lines: 50 In article <3044@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes: > >and did not confuse the two chips, but just to make sure, > >i860s and i960s are completely different chips. > > Right. I was just trying to cast aspersions on data that suggest we're > going to see an average of 2 instructions/cycle sometime this decade > (wow, big claim). It might be a couple of years. > The problem is lack of compilers, not the chips. Cast all the aspersions that you wish, but the 960CA can and does execute useful hand-written assembly-language code at a sustained rate of two instructions per clock. How useful is this to the average UNIX user? Not at all (at least right at the moment). The 960CA is an embedded controller. However, one can code a matrix multiply, bresenham, bezier, or other function to run at this sustained rate. In order to run on *all* code at a sustained rate of 2 instructions/clock, and normal (that is, non-VLIW) architecture would have to be capable of fetching and decoding far more than the 3 instructions per clock that the 960CA attempts. What is useful is that the 960CA can overlap the *dispatch* as well as the execution of the instruction stream, so that the sequence: addi g0,g1,g2 ldq (g0),g4 # load 4 words into g4..g7 addi g2,g3,g0 is executed in 2 clocks (though the load instruction will not return until some time later, depending on the latency of memory). The instruction stream will continue to be executed until g4,g5,g6, or g7 is used as a source operand in another instruction. With good scheduling technology, this will be several more instructions. So the 960CA introduces a new technique that dramatically increases the potential amount of parallelism available, without resorting to VLIW schemes that chew up memory space and bandwidth. This is what's new and important about the CA. Attached to the end of this article is a real program that runs at 66 native mips on the 960CA. > On the other hand, Multiflow has probably been doing it for years. Only if you assume a rich floating-point mix in their calculations. I doubt very much whether the kernel runs at 2 instructions/clock. That is our goal (albeit not yet realized). S. McGeady Intel Corp. Brought to you by Super Global Mega Corp .com