Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!claris!apple!bcase From: bcase@Apple.COM (Brian Case) Newsgroups: comp.arch Subject: Re: RPM-40 microprocessor @ 40 MHz; dat Message-ID: <7553@apple.Apple.Com> Date: 4 Mar 88 03:33:14 GMT References: <9727@steinmetz.steinmetz.UUCP> <9758@steinmetz.steinmetz.UUCP> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Ungermann-Bass Enterprises Lines: 60 In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes: >"Popular RISCs" don't have any latency on >ALU ops because they ARE ( No Dennis don't say it, no, no ... ) >SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD ) Boy, I must say I don't know what you are thinking. Do you mean they are slow because they don't have 40 MHz versions? Or do you mean that they are slow in terms of VAX-equivalent MIPS? If the former, then just wait a little while. There are probably more 40 MHz RISC machines in most other companies labs than there are in yours (I strongly suspect the MIPS guys have them, for example), but they won't let them out because of characterization and specification limitations (that is, they may only be 40 MHz (or even more) at room temperature). If the latter, I think you are wrong. To be less opaque, I think that the RPM40 VAX-equivalent MIPS is no better than, say, a 25 MHz Am29000 or a 16 MHz MIPS (both with caches, you understand; and I am not saying that the 25 MHz 29000 is the same as a 16 MHz MIPS). We're talking integer here. >IMHO, a pipelined processor should run as fast as the its ALU >lets it. Some RISC processors DO NOT do this. Instead, they >perform either the operand-read or the result-write for an >instruction in the same pipestage as the ALU op. Er, which ones do this? I don't know of any among MIPS, SPARC, Am29000, ARM (but it does have a shifter in there, which could be bad), even CLIPPER. In fact, I do know of one, but no one else out there probably does (it's still vaporware). >Even a simple bypass path adds to this delay. It means >that whatever the setup and delay times of this path, >it must be added to the basic machine cycle time, IF >that cycle time is determined by the ALU, as it SHOULD BE (IMHO). >This is LESS of a penalty than adding a register access, >but still a penalty. So is it a win ? I still agree that the ALU should govern cycle time (but I would always include bypassing; in my experience, there just isn't enough stuff to move around to spearate the computations from the uses with useful work a significant fraction of the time), but I now know that a much more probable cycle time determiner is cache cycle time. This can be either the instruction cache, or the TLB, or whatever. I suspect that omitting bypassing is a bad choice, but like you say, there isn't much "proof." >To be honest, I don't know. Although I have read plenty of >research on BRANCH latency, I haven't seen much research on >how often ALU result latency would result in interlocks, or >even on how often LOAD latency would result in interlocks. >Perhaps John Mashey has. If so, I'd like to see the The folklore to which I have been exposed goes like this: First load delay slot probability of being filled: 0.7; second load delay slot: 0.3; third delay slot: 0.1; thereafter, not significant. >references. Until then, I don't know what John means when he >says "any high-performance system" will :likely" have zero latency. >CRAYs don't. They're high performance. Aren't they ? For single-thread, integer computations, they're not "high performance" (or at least not "highest performance") by state-of-the-art RISC standards (at least our CRAY XMP isn't). Perhaps the CRAY 3 will be quite a bit ahead when it comes out, I dunno.