Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!cernvax!chx400!ethz!neptune!inf.ethz.ch!brandis
From: brandis@inf.ethz.ch (Marc Brandis)
Newsgroups: comp.sys.intel
Subject: Re: i960CA benchmark results
Keywords: i960CA, benchmarks
Message-ID: <14326@neptune.inf.ethz.ch>
Date: 8 Nov 90 14:16:38 GMT
References: <13912@neptune.inf.ethz.ch> <2464@lupine.NCD.COM>
Sender: news@neptune.inf.ethz.ch
Reply-To: brandis@inf.ethz.ch (Marc Brandis)
Organization: Departement Informatik, ETH, Zurich
Lines: 107

In article <2464@lupine.NCD.COM> rfg@NCD.COM (Ron Guilmette) writes:
>I believe that Intel states plainly that the 66 MIPS figure is peak
>(but I'm not 100% sure).  Perhaps it was 99 MIPS peak for the CA
>because of the possibility that up to three instructions could be
>in three different functional units at one time.  But you can't sustain
>that for any more than (perhaps) one cycle, because this (rare?)
>case only happens when three out of a group of four instructions
>meet certain criteria.  And then (I believe) you get to spend one
>cycle (or more) executing just the one remaining instruction out
>of that same group of four.

Intel states a sustained performance of 66 MIPS (intel 80960CA User's
Manual, A-2) and a peak performance of 99 MIPS. The instruction decoder
is able to decode and issue one instruction for each unit each cycle. The
three execution units are the arithmetic and logical operations unit (REG), the
control instructions unit (CTRL) and the memory access unit (MEM). 

There is no restriction that when in one cycle multiple instructions have
been executed, that there can be no more than one instruction in the next
cycle. As I said, the CPU can decode and issue three instructions each
cycle, given the right mix of instructions and - of course - no dependencies
between these instructions. The instructions to be executed in parallel have
to occur in a certain order (REG-MEM-CTRL), but a compiler can easily reorder
them for this, as there can be no structural dependencies between them anyway.

Note that when the Intel documentation says the scheduler looks at four    
instructions at once and is able to fetch four instructions at once, it does
not imply that the instructions to be executed in parallel have to be in the
same quadword. The scheduler has a "rolling quad-word instruction window" 
(intel 80960 CA User's Manual, B-4) and after scheduling instructions from it,
it considers the next four unexecuted instructions (B-7). However, it does not
tell how new instructions become inserted into the rolling instruction window.
So I do not know how the ordering is handled after some instructions from
the window have been executed and new ones have been added to the window.

I do not agree about the argueing that multiple instruction execution is
possible only in very rare cases. Consider the following mix of instructions.

	Control:		14%
	Arithmetic, logical:	39%
	Data transfer:		26%

The data is taken from "Hennessy & Patterson, Computer Architecture: A
Quantitative Approach", DLX Instruction Set Measurements, Average Column,
Page C-5. Note that this data covers only 79% of all instructions, the
rest is floating-point or rarely executed instructions.

Let us make the following assumptions: The DLX instruction set is similar
enough to the i960 that this average distribution will also be found on
the i960. In the absence of floating-point instructions, the above ratios
between control, arithmetic and data transfer instructions hold also for
the 100% case. 

If both assumptions hold (which I think is reasonable), we have about
50% arithmetic and logical instructions, about 32% data transfer instructions
and 18% control instructions.

In the absence of data dependencies, you would assume that the arithmetic
and logical unit becomes the bottleneck, while the control and memory
instructions can be easily scheduled in parallel with the arithmetic 
ones. This would result in a sustained rate of 66 MIPS.

However, I have to admit that looking only at the average distribution of
instructions does not give the whole picture. E.g., the distribution in
one benchmark for DLX (US Steel) looks like 23% control instructions,
49% arithmetic operations and 10% data transfers. Scaled to 100%, you
get about 28% control, 59% arithmetic and 13% data transfer. Again, the
arithmetic unit would be the bottleneck, but as it would have to execute
more than half of the instructions, you cannot achieve two instructions per
cycle.

I am not sure how all this stuff looks like at the statement- or loop-level.
Of course, these predictions of performance are only valid if the ratio 
of different kinds of instructions does not vary too much between different
program parts. And one should not forget that there may be also structural
dependencies between the instructions, that the compiler cannot remove
and that cause the instruction scheduler to stall.

It would be interesting to see, whether there are other flaws in the
design of the i960CA that reduce the achievable performance. I could
imagine that the small instruction cache, the missing data cache and the
small external bus may be bottlenecks. (I understand that the i960CA is
designed for embedded applications, but the programs used in embedded 
applications are not so different from programs in other environments,
considering both size and locality patterns).

>With respect to other compilers, I have no specific information.  I can
>say however that the folks at Intel are no dummies, and that they
>certainly realize that instruction scheduling is a very significant
>issue for i960 compilers.  I don't think it would be surprizing
>(to anyone) if we all found out (later on) that they were looking
>into the question of how to make their chips look better (performance-
>wise) via compiler technology.

I do not think that there is anything wrong if somebody tries to get the
best performance out of a CPU by writing a sophisticated compiler, as long
as he makes optimizations that will be valuable for a large number of
programs. Of course, implementing optimizations that help only the 
Dryhstone benchmark or so is not the way to go, but I do not have any
problems with a sophisticated instruction scheduler in a compiler.


Marc-Michael Brandis
Institut fuer Computersysteme
ETH-Zentrum
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch