Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!pasteur!ucbvax!hplabs!hp-pcd!uoregon!omepd!mcg From: mcg@omepd (Steven McGeady) Newsgroups: comp.arch Subject: Re: Press Release: Intel announces 80960 architecture Message-ID: <3375@omepd> Date: 14 Apr 88 17:37:57 GMT References: <3358@omepd> <49265@sun.uucp> Reply-To: mcg@iwarpo3.UUCP (Steve McGeady) Organization: Intel Corp., Hillsboro Lines: 88 In article <49265@sun.uucp> david@sun.uucp (David DiGiacomo) writes: > ... >> The 80960KA and the 80960KB are both available in 20MHz CHMOS* III >>configurations. Both embedded processors operate at a sustained 7.5 MIPS >>and 15K Dhrystones rates. > >Why is the integer performance so low? Do most instructions take 2 cycles? The short answer is yes, many instructions take two cycles in the current implementation. For the long answer, read on. Well, first, while 7.5 MIPS might seem slow for a $2000/chip workstation CPU, the price/performance of the 80960KA and KB is very good compared to its competition (whomever that may be) in the embedded marketplace. Claims of "the fastest microprocessor ever!" are: a) often false; and b) seldom true for very long. The 80960KA was the fastest microprocessor around when we hit silicon in 12/85, but we knew very well that fast silicon without quality tools and support wasn't very useful. I won't get dragged into the "my MIPS number is longer than your MIPS number" game that goes on here all too often. Second, as I hope to demonstrate soon, the 7.5 MIPS number is actually relatively conservative, and depends on the mix that you run. In other words, don't feel that you have to apply an automatic derating to that number because of past (mis-) deeds of unrelated marketeers. Our number is based on the integer Stanford benchmarks, grep, diff, compress, and other UNIX programs jerry-rigged to run in an embedded environment, and various customer benchmarks. I'm trying to gather some up-to-date benchmark info to post, but it's taking some time to get it together in a form the net's performance mavens won't shoot holes in. The *technical* answer to your question is that the register file *in the current implementations* is not multi-ported (enough ways) and that "RISC" instructions (typically 1 cycle) suffer an additional cycle latency if the value it needs is not either a literal or the destination register from the previous instruction. If the register file can be "bypassed", normal instructions execute in 1 cycle, otherwise they run in 2. Certain other instructions (bit extract, bit modify, check bit, compare-and-increment/decrement) take 2 cycles with bypass, 3 without. For completeness: Move instructions take 1 cycle per word. integer multiply takes 9-21 cycles (depending on # significant bits) typically 18. integer divide takes twice as long. The processor uses an early-out Booth multiplier. Branch instructiond take 0 (yes, zero) to 2 cycles. In the former case, branchs can often be overlapped with previous instructions. Loads and stores are pipelined (3 deep), and loads take 4 to 5 cycles, stores 2 to 3 cycles. Other (unrelated) instructions can be executed in the delay slot after the load. Thus, 3 loads can be executed in 7 cycles (due to the pipelining) and up to 3 additional instructions can be executed in the delay slots (safely, because of register scoreboarding). Call instructions take 9 cycles when a register set in the cache is available. Flushing a set of local registers takes an additional 24 cycles, depending on memory speed. Return takes 7 cycles, with the same caveat. The processor only flushes or reloads the register cache when necessary. The "call" and "return" instructions, contrary to normal RISC practice, do most of what is required to perform a subroutine linkage. The 80960 C entry prologue/epilogue is: _foo: # foo takes four integer args, has int [100] auto array ldconst 400,r15 addo sp,r15,sp # allocate auto space on stack movq g0,r4 # save parameter registers (move quad) ... mov ???,g0 # return value ret "ldconst" is a pseudo-op which expands to the most optimal way of loading a constant value. The stack adjustment is only done if there are local variables that do not fit in registers. The saving of the parameter registers is only done if the procedure is not a leaf procedure. Floating-point instructions take anywhere from 10 cycles (add-real) to 441 cycles (cosine). Most floating-point instructions are interruptible and resumable. The next generation of 80960, now under development, will remove the bypass miss limitation, as well as exploit more opportunities for fine-grained parallelism in the architecture. More I cannot say. S. McGeady Intel Corp.