Path: utzoo!mnetor!uunet!husc6!mailrus!ames!pasteur!ucbvax!hplabs!pyramid!voder!apple!bcase
From: bcase@Apple.COM (Brian Case)
Newsgroups: comp.arch
Subject: Re: RPM-40 microprocessor @ 40 MHz; dat
Message-ID: <7613@apple.Apple.Com>
Date: 9 Mar 88 19:59:52 GMT
References: <9792@steinmetz.steinmetz.UUCP> <9852@steinmetz.steinmetz.UUCP>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc, Cupertino, CA
Lines: 182

In article <9852@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>An article by bcase@apple.UUCP (Brian Case) says:
>********** In My "Humble" Opinion *********************************
>Things done right on RPM40, tho not neccesarily for the first time :

Thanks for the list.  I won't point out the startling similarities between
the RPM40 and the Am29000; most people will know I think.

>] >The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-),
>] 
>] With the memory system you assume, the Am29000 and I guess the R2000 would
>] run MIPS at their clock rates as well.
>
>Well, you are incorrect. The MIPS chip, correct me if I am wrong,
>needs a four-phase 32-MHz clock to execute 16MIPS (native,peak).
>The Am29000, I beleive, uses 25ns RAM just to make 25MHz,
>I don't know how many phases, and therefor I believe 25MIPS.
>
>Putting 25ns RAM on an R2000, it would still only execute at 16MIPS.
>The processor is not fast enough to take advantage of it. The
>Am29000 needs 25ns RAM just to run at 25MIPS. 

Using a four phase clock has nothing to do with my point.  The R2000 can
issue instructions continuously at a 16 MHz rate given the memory system
you assume (when I said clock rate, I didn't mean raw clock rate but
intenal instruction issue rate; sorry for the confusion).  The Am29000
has single-phase 25 MHz clock input (or 30 MHz if you buy that version).

You believe incorrectly.  The Am29000 can execute 25 native MIPS with
video DRAMs; 25 ns SRAMs everywhere would let it execute 25 MIPS all the
time regardless of other factors, but VDRAMs with proper scheduling of
loads and stores and sufficient reuse of jump targets will permit peak
performance (real programs don't run at peak but its acceptable for some
people given the cost savings since the performance is still good).

>] The question is how long it takes to get from start of program to
>] finish of program.  If the RPM40 is exeucting more loads and stores
>] and more register to register moves to make up for the relatively
>] small number of registers and lack of three-address instructions,
>] etc., then you aren't getting all the bang out of your 40 MHz.  On the
>] other hand, if it *is perfect for your application* then great.
>
>"Small number of registers"?? 21 G.P. registers is small ? Says who ?
>Talk to compiler writers : they tell us that 16 is just fine.

Well, I am a compiler writer too.  I say 16 (or 21) is too few.  This
arguement doesn't prove anything.  There is plenty of research (and even
a significant amount of practice; e.g. the MetaWare compiler for the
Am29000 does some pretty neat things!) describing how to use lots of
registers (see David Wall's (of DECWRL) research into register
allocation at link time, various stack cache implmementations, papers
on procedure integration, interprocedural register allocation, etc. etc.).

>Or maybe your thinking of the Berkelly(sp?)-style register window concept ?
>The R2000 doesn't have that. I think maybe the Am29000 does ??

It's Berkeley (and "you're" not "your" but I misspell things too).

Yes, the Am29000 has a more general register window implementation, but,
as pointed out above, that is not the only way to quite profitably use
lots of registers.

>] [Me argueing that the RPM40 will lose some performance due to some
>] architectural things and that the lack of a TLB makes comparisons
>] slightly unfair.]
>WEll, beyond arguing that a TLB may not slow it down, which contract
>prevents me from discussing, I'll say this : applications that
>don't need a TLB shouldn't pay for a TLB. 

I fully agree.  However, you shouldn't then turn around and say that the
RPM 40 will make a fine UNIX box until you can prove that a TLB will not
cause performance loss.  Look, if I can't claim that your 40 MHz in the lab
is not special because I can't disclose what I know, then you can't sit
there and claim that you know something but can't disclose it.  Saying
that "contract prevents me" is not substantiation for your claim.  Contract
prevents me from saying what I know about other people's 40 MHz chips,
so what?

>] ... the RPM40 must be evaluated with a TLB in order to be
>] compared to most other chips.
>
>Like the MC680[012]0 family ??  1750A processors ?? AN/YUK-14's ??
>None of these have TLBs.

No, I meant the Am29000 and the R2000, but let's not forget the SPARC
(as in SUN 4s).  I really believe that the RPM40 is top dog in its
world (MC680[012] family, 1750A processors, AN/YUK-14s).  Maybe the
R2000 and the Am29000 wouldn't make it there, or maybe they would.  But
don't say the RPM 40 doesn't need a TLB because its world is 1750As and
AN/YUK-14s and then complain when John Mashey (for example) says that
it won't make the best UNIX box.

>] Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS
>] machine at 16 MHz (not the 8 MIPS you quoted).
>
>Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for
>their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which

Right, that's the R2000 in the fastest version currently available.

>places addresses on the address bus once every 30ns. THAT's why
>"MHz" is TOTALLY inappropriate, WORSE than native-peak MIPS, even.
>An RPM40 at 32MHz would also place addresses on the address bus once
>every 30ns, but would execute 32-native-peak-MIPS.

Again, I always assume MHz to be the peak instruction issue rate.  I
think most people do too, but my assumption has caused confusion once
again.  Sorry.

Yes, I agree that the bus strategy used by MIPS is questionable at very
high clock rates (read:  instruction issue rates).  We've been through
that issue before.  But it buys them something too!  Since they are
willing to pay for the external cache, it means that they don't have
to put a branch target cache or other instruction cache on chip.  They
were betting (I guess) that clock rates wouldn't get astronomical before
density would let them put a decent sized instruction cache on chip.  It's
a tradeoff, that's all it is.  Sure, they pay a cost, but they get a
benefit too.  You assume SRAMs.  You pay a cost, you get a benefit.  An
Am29000 system can be built with VDRAMs (so could the RPM 40, I bet, but
not at 40 MHz unless someone makes 40 MHz VDRAMs that I don't know of
(the Am29000 will run into this wall soon too)): you pay a cost
(performance loss compared with the max.) but you get a benefit (lower
system cost when you want more memory than SRAMs will let you afford).

Now, as to who has better performance (which is the crux of this
arguement, I think):  it can't be decided until we all agree on a system
environment:  if you want to use your SRAMs, then let us use them too.
If you want to talk about multi-tasking, then we should all have TLBs.

>What's the smallest signal interval on a 25MHz Am29000 ? In the RPM40,
>NO signal ever assumes more than one valid state during a cycle.
>This is not true of the R2000. Is it true of Am29000 ? 

I'm not sure I understand exactly what you mean; but I think the smallest
signal interval is one clock cycle (i.e., the channel is synchronized to
the rising clock edge).  If there is a signal that doesn't satisfy your
definition, then it would probably be the "bus invalid" signal which is
determined by the success or failure of address translation (which isn't
known until about half-way through the cycle, I think).

>] In your reponse to my response, you go on to say that we should not judge
>] performance by either peak native instructions per second or MHz.  I don't
>] know anyone here who would dissagree with you (except marketing people:
>] what else can they say?).  In my claim above, I adhered to just that
>] philosophy.  This also is what most manufacturers of concern to us here
>] strive for (esp. MIPS Co.).
>
>All three need to be paid attention to. They make big differences.
>For instance, native-MIPS-per-MHz can range from 5 or less
>in a CISC machine, to about 1 for a RISC, to 65K or more for
>a big parrallel machine. And there's only so fast any particular
>technology will let you run the clock, so it DOES matter.

I don't understand "so it DOES matter."  I thought you were, at first,
trying to say that we should compare based on VAX-equivalents (or
some other universal "meter bar").  I tried to say that everyone agrees.
So, now, I don't understand what is the "it" in "so it DOES matter."
I thought you were trying to say "just buy the one that runs my program
fastest" (and I would add "in my price range" but that's another matter).
I don't really need to care what the native-MIPS-per-MHz is ("if any word
is innappropriate at the end of a sentence, a linking verb is.").  On
the other hand, it'll tell you something about the machine, that's for
sure.

I am growing weary.  It is not my goal to slander the RPM40.  I am just
trying for accuracy.  I just want arguements to be well constructed.
We need to all be talking about reasonably similar system environments
and compiler generated code (or not, but we need to agree).  The problem
gets started when deficiencies, or call them "design decisions," are
pointed out and then blindly refuted.

For example, the Am29000 ain't no perfect being.  Features/design
decisions were reported and discussed here.  Much to my dismay, things
that I thought were great maybe aren't so great in every situation.
I, in my naive way, thought the compare-bytes instruction would make
every C string-handling program blazingly fast.  Oops, although I
fought it at first, some nice statistics, though not absolutely
conclusive, from John Mashey's simulation showed that really significant
improvements would be the exception rather than the rule (at least for
UNIX utilities).  That is the kind way to hold a discussion.  The
recent postings of stats about forwarding usage are also extremely
interesting.