Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!brutus.cs.uiuc.edu!apple!oliveb!mipos3!omepd!mipon2.intel.com!mcg
From: mcg@mipon2.intel.com (Steven McGeady)
Newsgroups: comp.arch
Subject: Re: 55 MIPS & 66 MIPS [really: i960, i860, etc]
Message-ID: <5276@omepd.UUCP>
Date: 28 Nov 89 02:39:06 GMT
References: <3044@brazos.Rice.edu> <1358@bnr-rsc.UUCP> <31329@winchester.mips.COM> <22303@gryphon.COM> <3024@brazos.Rice.edu> <31659@winchester.mips.COM>
Sender: news@omepd.UUCP
Reply-To: mcg@mipon2.intel.com (Steven McGeady)
Lines: 50


In article <3044@brazos.Rice.edu>, preston@titan.rice.edu (Preston
Briggs) writes:

> >and did not confuse the two chips, but just to make sure,
> >i860s and i960s are completely different chips.
> 
> Right.  I was just trying to cast aspersions on data that suggest we're
> going to see an average of 2 instructions/cycle sometime this decade
> (wow, big claim).  It might be a couple of years.  
> The problem is lack of compilers, not the chips.

Cast all the aspersions that you wish, but the 960CA can and does execute
useful hand-written assembly-language code at a sustained rate of two
instructions per clock.  How useful is this to the average UNIX user?
Not at all (at least right at the moment).  The 960CA is an embedded
controller.  However, one can code a matrix multiply, bresenham,
bezier, or other function to run at this sustained rate.

In order to run on *all* code at a sustained rate of 2 instructions/clock,
and normal (that is, non-VLIW) architecture would have to be capable of
fetching and decoding far more than the 3 instructions per clock that the
960CA attempts.  What is useful is that the 960CA can overlap the
*dispatch* as well as the execution of the instruction stream, so
that the sequence:

	addi	g0,g1,g2
	ldq	(g0),g4		# load 4 words into g4..g7
	addi	g2,g3,g0

is executed in 2 clocks (though the load instruction will not return
until some time later, depending on the latency of memory).  The instruction
stream will continue to be executed until g4,g5,g6, or g7 is used as
a source operand in another instruction.  With good scheduling technology,
this will be several more instructions.

So the 960CA introduces a new technique that dramatically increases the
potential amount of parallelism available, without resorting to VLIW
schemes that chew up memory space and bandwidth.  This is what's new
and important about the CA.  Attached to the end of this article is a
real program that runs at 66 native mips on the 960CA.

> On the other hand, Multiflow has probably been doing it for years.

Only if you assume a rich floating-point mix in their calculations.  I
doubt very much whether the kernel runs at 2 instructions/clock.  That
is our goal (albeit not yet realized).

S. McGeady
Intel Corp.


Brought to you by Super Global Mega Corp .com