Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!pasteur!ucbvax!hplabs!hp-pcd!uoregon!omepd!mcg
From: mcg@omepd (Steven McGeady)
Newsgroups: comp.arch
Subject: Re: Press Release: Intel announces 80960 architecture
Message-ID: <3375@omepd>
Date: 14 Apr 88 17:37:57 GMT
References: <3358@omepd> <49265@sun.uucp>
Reply-To: mcg@iwarpo3.UUCP (Steve McGeady)
Organization: Intel Corp., Hillsboro
Lines: 88


In article <49265@sun.uucp> david@sun.uucp (David DiGiacomo) writes:
> ...
>>   The 80960KA and the 80960KB are both available in 20MHz CHMOS* III
>>configurations.  Both embedded processors operate at a sustained 7.5 MIPS
>>and 15K Dhrystones rates.
>
>Why is the integer performance so low?  Do most instructions take 2 cycles?

The short answer is yes, many instructions take two cycles in the current
implementation.  For the long answer, read on.

Well, first, while 7.5 MIPS might seem slow for a $2000/chip workstation
CPU, the price/performance of the 80960KA and KB is very good compared to its
competition (whomever that may be) in the embedded marketplace.   Claims
of "the fastest microprocessor ever!" are: a) often false; and b) seldom
true for very long.  The 80960KA was the fastest microprocessor around
when we hit silicon in 12/85, but we knew very well that fast silicon
without quality tools and support wasn't very useful.  I won't get dragged
into the "my MIPS number is longer than your MIPS number" game that goes
on here all too often.

Second, as I hope to demonstrate soon, the 7.5 MIPS number is actually
relatively conservative, and depends on the mix that you run.  In other words,
don't feel that you have to apply an automatic derating to that number
because of past (mis-) deeds of unrelated marketeers.

Our number is based on the integer Stanford benchmarks, grep, diff, compress,
and other UNIX programs jerry-rigged to run in an embedded environment, and
various customer benchmarks.  I'm trying to gather some up-to-date benchmark
info to post, but it's taking some time to get it together in a form the net's
performance mavens won't shoot holes in.

The *technical* answer to your question is that the register file
*in the current implementations* is not multi-ported (enough ways) and that
"RISC" instructions (typically 1 cycle) suffer an additional cycle latency if 
the value it needs is not either a literal or the destination register
from the previous instruction.  If the register file can be "bypassed",
normal instructions execute in 1 cycle, otherwise they run in 2.
Certain other instructions (bit extract, bit modify, check bit,
compare-and-increment/decrement) take 2 cycles with bypass, 3 without.

For completeness:
  Move instructions take 1 cycle per word.

  integer multiply takes 9-21 cycles (depending on # significant bits)
  typically 18.  integer divide takes twice as long.  The processor uses an
  early-out Booth multiplier.

  Branch instructiond take 0 (yes, zero) to 2 cycles.  In the former case,
  branchs can often be overlapped with previous instructions.

  Loads and stores are pipelined (3 deep), and loads take 4 to 5 cycles, stores
  2 to 3 cycles.  Other (unrelated) instructions can be executed in the delay
  slot after the load.  Thus, 3 loads can be executed in 7 cycles (due to
  the pipelining) and up to 3 additional instructions can be executed in
  the delay slots (safely, because of register scoreboarding).

  Call instructions take 9 cycles when a register set in the cache is
  available.  Flushing a set of local registers takes an additional 24 cycles,
  depending on memory speed.  Return takes 7 cycles, with the same caveat.
  The processor only flushes or reloads the register cache when necessary.
  The "call" and "return" instructions, contrary to normal RISC practice,
  do most of what is required to perform a subroutine linkage.  The 80960
  C entry prologue/epilogue is:

	_foo:	# foo takes four integer args, has int [100] auto array
		ldconst	400,r15
		addo	sp,r15,sp	# allocate auto space on stack
		movq	g0,r4		# save parameter registers (move quad)
		...
		mov	???,g0		# return value
		ret

  "ldconst" is a pseudo-op which expands to the most optimal way of loading
  a constant value.  The stack adjustment is only done if there are local
  variables that do not fit in registers.  The saving of the parameter
  registers is only done if the procedure is not a leaf procedure.

  Floating-point instructions take anywhere from 10 cycles (add-real) to 441
  cycles (cosine).  Most floating-point instructions are interruptible and
  resumable.

The next generation of 80960, now under development, will remove the bypass
miss limitation, as well as exploit more opportunities for fine-grained
parallelism in the architecture.  More I cannot say.

S. McGeady
Intel Corp.