Path: utzoo!mnetor!uunet!steinmetz!sungoddess!oconnor
From: oconnor@sungoddess.steinmetz (Dennis M. O'Connor)
Newsgroups: comp.arch
Subject: Re: RPM-40 microprocessor @ 40 MHz; dat
Message-ID: <9758@steinmetz.steinmetz.UUCP>
Date: 2 Mar 88 15:44:50 GMT
References: <9727@steinmetz.steinmetz.UUCP>
Sender: news@steinmetz.steinmetz.UUCP
Reply-To: sungoddess!oconnor@steinmetz.UUCP
Organization: GE Corporate R&D Center
Lines: 64

An article by mash@winchester.UUCP (John Mashey) says:
] In article <...> sunset!oconnor@steinmetz.UUCP writes:
] ...
] >] [...] how would you compare PREFIX to an instruction SHIFT and
] >] OR --  SHOR r,lit ::== r := (r<<14)|lit?
] >
] > [...] PREFIX as implimented in RPM40 have no latency
] >problems (major win). SHOR would have latency problems.
] 
] Why would it have latency problems? None of the popular RISCs have
] latency problems with r = r op literal for the usual ops.

Then the RPM40 and its GaAs brethren aren't "popular RISCs".

] I.e., any high-performance system is likely to make use of
] register-bypassing anyway, so that:
] 	r = r op literal
] 	r = r op r
] has zero intervening latency (the performance penalty of a
] cycle's latency for such things is large).

Who said we don't use register bypassing ? But that's not
the point. "Popular RISCs" don't have any latency on
ALU ops because they ARE ( No Dennis don't say it, no, no ... )
SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)
An explanation follows :

IMHO, a pipelined processor should run as fast as the its ALU 
lets it. Some RISC processors DO NOT do this. Instead, they
perform either the operand-read or the result-write for an
instruction in the same pipestage as the ALU op. This results
in a BIG increase in cycle time, and therefore a BIG decrease
in performance.

E.G : say your ALU latency is 25ns, and your register read or write
takes 10ns. Combine a register access with the ALU operation and
you have a 28MIPS machine. Seperate them and you have a 40MIPS
machine. But you have higher latency. So which is the win ?

Even a simple bypass path adds to this delay. It means
that whatever the setup and delay times of this path,
it must be added to the basic machine cycle time, IF
that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
This is LESS of a penalty than adding a register access,
but still a penalty. So is it a win ?

To be honest, I don't know. Although I have read plenty of
research on BRANCH latency, I haven't seen much research on
how often ALU result latency would result in interlocks, or
even on how often LOAD latency would result in interlocks.
Perhaps John Mashey has. If so, I'd like to see the
references. Until then, I don't know what John means when he
says "any high-performance system" will :likely" have zero latency.
CRAYs don't. They're high performance. Aren't they ?

] -john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Yes, I'm still smiling. Forgive my, uh, "SLOW" outburst : Sorry !


--
    Dennis O'Connor			      UUNET!steinmetz!sunset!oconnor
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)