Path: utzoo!mnetor!uunet!husc6!mailrus!ames!pasteur!ucbvax!hplabs!pyramid!voder!apple!bcase From: bcase@Apple.COM (Brian Case) Newsgroups: comp.arch Subject: Re: RPM-40 microprocessor @ 40 MHz; dat Message-ID: <7613@apple.Apple.Com> Date: 9 Mar 88 19:59:52 GMT References: <9792@steinmetz.steinmetz.UUCP> <9852@steinmetz.steinmetz.UUCP> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc, Cupertino, CA Lines: 182 In article <9852@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes: >An article by bcase@apple.UUCP (Brian Case) says: >********** In My "Humble" Opinion ********************************* >Things done right on RPM40, tho not neccesarily for the first time : Thanks for the list. I won't point out the startling similarities between the RPM40 and the Am29000; most people will know I think. >] >The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-), >] >] With the memory system you assume, the Am29000 and I guess the R2000 would >] run MIPS at their clock rates as well. > >Well, you are incorrect. The MIPS chip, correct me if I am wrong, >needs a four-phase 32-MHz clock to execute 16MIPS (native,peak). >The Am29000, I beleive, uses 25ns RAM just to make 25MHz, >I don't know how many phases, and therefor I believe 25MIPS. > >Putting 25ns RAM on an R2000, it would still only execute at 16MIPS. >The processor is not fast enough to take advantage of it. The >Am29000 needs 25ns RAM just to run at 25MIPS. Using a four phase clock has nothing to do with my point. The R2000 can issue instructions continuously at a 16 MHz rate given the memory system you assume (when I said clock rate, I didn't mean raw clock rate but intenal instruction issue rate; sorry for the confusion). The Am29000 has single-phase 25 MHz clock input (or 30 MHz if you buy that version). You believe incorrectly. The Am29000 can execute 25 native MIPS with video DRAMs; 25 ns SRAMs everywhere would let it execute 25 MIPS all the time regardless of other factors, but VDRAMs with proper scheduling of loads and stores and sufficient reuse of jump targets will permit peak performance (real programs don't run at peak but its acceptable for some people given the cost savings since the performance is still good). >] The question is how long it takes to get from start of program to >] finish of program. If the RPM40 is exeucting more loads and stores >] and more register to register moves to make up for the relatively >] small number of registers and lack of three-address instructions, >] etc., then you aren't getting all the bang out of your 40 MHz. On the >] other hand, if it *is perfect for your application* then great. > >"Small number of registers"?? 21 G.P. registers is small ? Says who ? >Talk to compiler writers : they tell us that 16 is just fine. Well, I am a compiler writer too. I say 16 (or 21) is too few. This arguement doesn't prove anything. There is plenty of research (and even a significant amount of practice; e.g. the MetaWare compiler for the Am29000 does some pretty neat things!) describing how to use lots of registers (see David Wall's (of DECWRL) research into register allocation at link time, various stack cache implmementations, papers on procedure integration, interprocedural register allocation, etc. etc.). >Or maybe your thinking of the Berkelly(sp?)-style register window concept ? >The R2000 doesn't have that. I think maybe the Am29000 does ?? It's Berkeley (and "you're" not "your" but I misspell things too). Yes, the Am29000 has a more general register window implementation, but, as pointed out above, that is not the only way to quite profitably use lots of registers. >] [Me argueing that the RPM40 will lose some performance due to some >] architectural things and that the lack of a TLB makes comparisons >] slightly unfair.] >WEll, beyond arguing that a TLB may not slow it down, which contract >prevents me from discussing, I'll say this : applications that >don't need a TLB shouldn't pay for a TLB. I fully agree. However, you shouldn't then turn around and say that the RPM 40 will make a fine UNIX box until you can prove that a TLB will not cause performance loss. Look, if I can't claim that your 40 MHz in the lab is not special because I can't disclose what I know, then you can't sit there and claim that you know something but can't disclose it. Saying that "contract prevents me" is not substantiation for your claim. Contract prevents me from saying what I know about other people's 40 MHz chips, so what? >] ... the RPM40 must be evaluated with a TLB in order to be >] compared to most other chips. > >Like the MC680[012]0 family ?? 1750A processors ?? AN/YUK-14's ?? >None of these have TLBs. No, I meant the Am29000 and the R2000, but let's not forget the SPARC (as in SUN 4s). I really believe that the RPM40 is top dog in its world (MC680[012] family, 1750A processors, AN/YUK-14s). Maybe the R2000 and the Am29000 wouldn't make it there, or maybe they would. But don't say the RPM 40 doesn't need a TLB because its world is 1750As and AN/YUK-14s and then complain when John Mashey (for example) says that it won't make the best UNIX box. >] Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS >] machine at 16 MHz (not the 8 MIPS you quoted). > >Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for >their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which Right, that's the R2000 in the fastest version currently available. >places addresses on the address bus once every 30ns. THAT's why >"MHz" is TOTALLY inappropriate, WORSE than native-peak MIPS, even. >An RPM40 at 32MHz would also place addresses on the address bus once >every 30ns, but would execute 32-native-peak-MIPS. Again, I always assume MHz to be the peak instruction issue rate. I think most people do too, but my assumption has caused confusion once again. Sorry. Yes, I agree that the bus strategy used by MIPS is questionable at very high clock rates (read: instruction issue rates). We've been through that issue before. But it buys them something too! Since they are willing to pay for the external cache, it means that they don't have to put a branch target cache or other instruction cache on chip. They were betting (I guess) that clock rates wouldn't get astronomical before density would let them put a decent sized instruction cache on chip. It's a tradeoff, that's all it is. Sure, they pay a cost, but they get a benefit too. You assume SRAMs. You pay a cost, you get a benefit. An Am29000 system can be built with VDRAMs (so could the RPM 40, I bet, but not at 40 MHz unless someone makes 40 MHz VDRAMs that I don't know of (the Am29000 will run into this wall soon too)): you pay a cost (performance loss compared with the max.) but you get a benefit (lower system cost when you want more memory than SRAMs will let you afford). Now, as to who has better performance (which is the crux of this arguement, I think): it can't be decided until we all agree on a system environment: if you want to use your SRAMs, then let us use them too. If you want to talk about multi-tasking, then we should all have TLBs. >What's the smallest signal interval on a 25MHz Am29000 ? In the RPM40, >NO signal ever assumes more than one valid state during a cycle. >This is not true of the R2000. Is it true of Am29000 ? I'm not sure I understand exactly what you mean; but I think the smallest signal interval is one clock cycle (i.e., the channel is synchronized to the rising clock edge). If there is a signal that doesn't satisfy your definition, then it would probably be the "bus invalid" signal which is determined by the success or failure of address translation (which isn't known until about half-way through the cycle, I think). >] In your reponse to my response, you go on to say that we should not judge >] performance by either peak native instructions per second or MHz. I don't >] know anyone here who would dissagree with you (except marketing people: >] what else can they say?). In my claim above, I adhered to just that >] philosophy. This also is what most manufacturers of concern to us here >] strive for (esp. MIPS Co.). > >All three need to be paid attention to. They make big differences. >For instance, native-MIPS-per-MHz can range from 5 or less >in a CISC machine, to about 1 for a RISC, to 65K or more for >a big parrallel machine. And there's only so fast any particular >technology will let you run the clock, so it DOES matter. I don't understand "so it DOES matter." I thought you were, at first, trying to say that we should compare based on VAX-equivalents (or some other universal "meter bar"). I tried to say that everyone agrees. So, now, I don't understand what is the "it" in "so it DOES matter." I thought you were trying to say "just buy the one that runs my program fastest" (and I would add "in my price range" but that's another matter). I don't really need to care what the native-MIPS-per-MHz is ("if any word is innappropriate at the end of a sentence, a linking verb is."). On the other hand, it'll tell you something about the machine, that's for sure. I am growing weary. It is not my goal to slander the RPM40. I am just trying for accuracy. I just want arguements to be well constructed. We need to all be talking about reasonably similar system environments and compiler generated code (or not, but we need to agree). The problem gets started when deficiencies, or call them "design decisions," are pointed out and then blindly refuted. For example, the Am29000 ain't no perfect being. Features/design decisions were reported and discussed here. Much to my dismay, things that I thought were great maybe aren't so great in every situation. I, in my naive way, thought the compare-bytes instruction would make every C string-handling program blazingly fast. Oops, although I fought it at first, some nice statistics, though not absolutely conclusive, from John Mashey's simulation showed that really significant improvements would be the exception rather than the rule (at least for UNIX utilities). That is the kind way to hold a discussion. The recent postings of stats about forwarding usage are also extremely interesting.