Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!claris!apple!bcase
From: bcase@Apple.COM (Brian Case)
Newsgroups: comp.arch
Subject: Re: RPM-40 microprocessor @ 40 MHz; dat
Message-ID: <7553@apple.Apple.Com>
Date: 4 Mar 88 03:33:14 GMT
References: <9727@steinmetz.steinmetz.UUCP> <9758@steinmetz.steinmetz.UUCP>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Ungermann-Bass Enterprises
Lines: 60

In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>"Popular RISCs" don't have any latency on
>ALU ops because they ARE ( No Dennis don't say it, no, no ... )
>SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)

Boy, I must say I don't know what you are thinking.  Do you mean they are
slow because they don't have 40 MHz versions?  Or do you mean that they
are slow in terms of VAX-equivalent MIPS?  If the former, then just wait
a little while.  There are probably more 40 MHz RISC machines in most
other companies labs than there are in yours (I strongly suspect the MIPS
guys have them, for example), but they won't let them out because of
characterization and specification limitations (that is, they may only
be 40 MHz (or even more) at room temperature).  If the latter, I think
you are wrong.  To be less opaque, I think that the RPM40 VAX-equivalent
MIPS is no better than, say, a 25 MHz Am29000 or a 16 MHz MIPS (both
with caches, you understand; and I am not saying that the 25 MHz 29000 is
the same as a 16 MHz MIPS).  We're talking integer here.

>IMHO, a pipelined processor should run as fast as the its ALU 
>lets it. Some RISC processors DO NOT do this. Instead, they
>perform either the operand-read or the result-write for an
>instruction in the same pipestage as the ALU op.

Er, which ones do this?  I don't know of any among MIPS, SPARC, Am29000,
ARM (but it does have a shifter in there, which could be bad), even
CLIPPER.  In fact, I do know of one, but no one else out there probably
does (it's still vaporware).

>Even a simple bypass path adds to this delay. It means
>that whatever the setup and delay times of this path,
>it must be added to the basic machine cycle time, IF
>that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
>This is LESS of a penalty than adding a register access,
>but still a penalty. So is it a win ?

I still agree that the ALU should govern cycle time (but I would always
include bypassing; in my experience, there just isn't enough stuff to move
around to spearate the computations from the uses with useful work a
significant fraction of the time), but I now know that a much more
probable cycle time determiner is cache cycle time.  This can be either
the instruction cache, or the TLB, or whatever.  I suspect that omitting
bypassing is a bad choice, but like you say, there isn't much "proof."

>To be honest, I don't know. Although I have read plenty of
>research on BRANCH latency, I haven't seen much research on
>how often ALU result latency would result in interlocks, or
>even on how often LOAD latency would result in interlocks.
>Perhaps John Mashey has. If so, I'd like to see the

The folklore to which I have been exposed goes like this:  First load
delay slot probability of being filled:  0.7; second load delay slot: 0.3;
third delay slot:  0.1; thereafter, not significant.

>references. Until then, I don't know what John means when he
>says "any high-performance system" will :likely" have zero latency.
>CRAYs don't. They're high performance. Aren't they ?

For single-thread, integer computations, they're not "high performance"
(or at least not "highest performance") by state-of-the-art RISC
standards (at least our CRAY XMP isn't).  Perhaps the CRAY 3 will be
quite a bit ahead when it comes out, I dunno.