Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!cernvax!chx400!ethz!neptune!inf.ethz.ch!brandis From: brandis@inf.ethz.ch (Marc Brandis) Newsgroups: comp.sys.intel Subject: Re: i960CA benchmark results Keywords: i960CA, benchmarks Message-ID: <14326@neptune.inf.ethz.ch> Date: 8 Nov 90 14:16:38 GMT References: <13912@neptune.inf.ethz.ch> <2464@lupine.NCD.COM> Sender: news@neptune.inf.ethz.ch Reply-To: brandis@inf.ethz.ch (Marc Brandis) Organization: Departement Informatik, ETH, Zurich Lines: 107 In article <2464@lupine.NCD.COM> rfg@NCD.COM (Ron Guilmette) writes: >I believe that Intel states plainly that the 66 MIPS figure is peak >(but I'm not 100% sure). Perhaps it was 99 MIPS peak for the CA >because of the possibility that up to three instructions could be >in three different functional units at one time. But you can't sustain >that for any more than (perhaps) one cycle, because this (rare?) >case only happens when three out of a group of four instructions >meet certain criteria. And then (I believe) you get to spend one >cycle (or more) executing just the one remaining instruction out >of that same group of four. Intel states a sustained performance of 66 MIPS (intel 80960CA User's Manual, A-2) and a peak performance of 99 MIPS. The instruction decoder is able to decode and issue one instruction for each unit each cycle. The three execution units are the arithmetic and logical operations unit (REG), the control instructions unit (CTRL) and the memory access unit (MEM). There is no restriction that when in one cycle multiple instructions have been executed, that there can be no more than one instruction in the next cycle. As I said, the CPU can decode and issue three instructions each cycle, given the right mix of instructions and - of course - no dependencies between these instructions. The instructions to be executed in parallel have to occur in a certain order (REG-MEM-CTRL), but a compiler can easily reorder them for this, as there can be no structural dependencies between them anyway. Note that when the Intel documentation says the scheduler looks at four instructions at once and is able to fetch four instructions at once, it does not imply that the instructions to be executed in parallel have to be in the same quadword. The scheduler has a "rolling quad-word instruction window" (intel 80960 CA User's Manual, B-4) and after scheduling instructions from it, it considers the next four unexecuted instructions (B-7). However, it does not tell how new instructions become inserted into the rolling instruction window. So I do not know how the ordering is handled after some instructions from the window have been executed and new ones have been added to the window. I do not agree about the argueing that multiple instruction execution is possible only in very rare cases. Consider the following mix of instructions. Control: 14% Arithmetic, logical: 39% Data transfer: 26% The data is taken from "Hennessy & Patterson, Computer Architecture: A Quantitative Approach", DLX Instruction Set Measurements, Average Column, Page C-5. Note that this data covers only 79% of all instructions, the rest is floating-point or rarely executed instructions. Let us make the following assumptions: The DLX instruction set is similar enough to the i960 that this average distribution will also be found on the i960. In the absence of floating-point instructions, the above ratios between control, arithmetic and data transfer instructions hold also for the 100% case. If both assumptions hold (which I think is reasonable), we have about 50% arithmetic and logical instructions, about 32% data transfer instructions and 18% control instructions. In the absence of data dependencies, you would assume that the arithmetic and logical unit becomes the bottleneck, while the control and memory instructions can be easily scheduled in parallel with the arithmetic ones. This would result in a sustained rate of 66 MIPS. However, I have to admit that looking only at the average distribution of instructions does not give the whole picture. E.g., the distribution in one benchmark for DLX (US Steel) looks like 23% control instructions, 49% arithmetic operations and 10% data transfers. Scaled to 100%, you get about 28% control, 59% arithmetic and 13% data transfer. Again, the arithmetic unit would be the bottleneck, but as it would have to execute more than half of the instructions, you cannot achieve two instructions per cycle. I am not sure how all this stuff looks like at the statement- or loop-level. Of course, these predictions of performance are only valid if the ratio of different kinds of instructions does not vary too much between different program parts. And one should not forget that there may be also structural dependencies between the instructions, that the compiler cannot remove and that cause the instruction scheduler to stall. It would be interesting to see, whether there are other flaws in the design of the i960CA that reduce the achievable performance. I could imagine that the small instruction cache, the missing data cache and the small external bus may be bottlenecks. (I understand that the i960CA is designed for embedded applications, but the programs used in embedded applications are not so different from programs in other environments, considering both size and locality patterns). >With respect to other compilers, I have no specific information. I can >say however that the folks at Intel are no dummies, and that they >certainly realize that instruction scheduling is a very significant >issue for i960 compilers. I don't think it would be surprizing >(to anyone) if we all found out (later on) that they were looking >into the question of how to make their chips look better (performance- >wise) via compiler technology. I do not think that there is anything wrong if somebody tries to get the best performance out of a CPU by writing a sophisticated compiler, as long as he makes optimizations that will be valuable for a large number of programs. Of course, implementing optimizations that help only the Dryhstone benchmark or so is not the way to go, but I do not have any problems with a sophisticated instruction scheduler in a compiler. Marc-Michael Brandis Institut fuer Computersysteme ETH-Zentrum CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch