Path: utzoo!mnetor!uunet!steinmetz!sungoddess!oconnor From: oconnor@sungoddess.steinmetz (Dennis M. O'Connor) Newsgroups: comp.arch Subject: Re: RPM-40 microprocessor @ 40 MHz; dat Message-ID: <9758@steinmetz.steinmetz.UUCP> Date: 2 Mar 88 15:44:50 GMT References: <9727@steinmetz.steinmetz.UUCP> Sender: news@steinmetz.steinmetz.UUCP Reply-To: sungoddess!oconnor@steinmetz.UUCP Organization: GE Corporate R&D Center Lines: 64 An article by mash@winchester.UUCP (John Mashey) says: ] In article <...> sunset!oconnor@steinmetz.UUCP writes: ] ... ] >] [...] how would you compare PREFIX to an instruction SHIFT and ] >] OR -- SHOR r,lit ::== r := (r<<14)|lit? ] > ] > [...] PREFIX as implimented in RPM40 have no latency ] >problems (major win). SHOR would have latency problems. ] ] Why would it have latency problems? None of the popular RISCs have ] latency problems with r = r op literal for the usual ops. Then the RPM40 and its GaAs brethren aren't "popular RISCs". ] I.e., any high-performance system is likely to make use of ] register-bypassing anyway, so that: ] r = r op literal ] r = r op r ] has zero intervening latency (the performance penalty of a ] cycle's latency for such things is large). Who said we don't use register bypassing ? But that's not the point. "Popular RISCs" don't have any latency on ALU ops because they ARE ( No Dennis don't say it, no, no ... ) SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD ) An explanation follows : IMHO, a pipelined processor should run as fast as the its ALU lets it. Some RISC processors DO NOT do this. Instead, they perform either the operand-read or the result-write for an instruction in the same pipestage as the ALU op. This results in a BIG increase in cycle time, and therefore a BIG decrease in performance. E.G : say your ALU latency is 25ns, and your register read or write takes 10ns. Combine a register access with the ALU operation and you have a 28MIPS machine. Seperate them and you have a 40MIPS machine. But you have higher latency. So which is the win ? Even a simple bypass path adds to this delay. It means that whatever the setup and delay times of this path, it must be added to the basic machine cycle time, IF that cycle time is determined by the ALU, as it SHOULD BE (IMHO). This is LESS of a penalty than adding a register access, but still a penalty. So is it a win ? To be honest, I don't know. Although I have read plenty of research on BRANCH latency, I haven't seen much research on how often ALU result latency would result in interlocks, or even on how often LOAD latency would result in interlocks. Perhaps John Mashey has. If so, I'd like to see the references. Until then, I don't know what John means when he says "any high-performance system" will :likely" have zero latency. CRAYs don't. They're high performance. Aren't they ? ] -john mashey DISCLAIMER: Yes, I'm still smiling. Forgive my, uh, "SLOW" outburst : Sorry ! -- Dennis O'Connor UUNET!steinmetz!sunset!oconnor ARPA: OCONNORDM@ge-crd.arpa (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)