Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!ames!hao!boulder!sunybcs!bingvaxu!leah!itsgw!imagine!pawl17.pawl.rpi.edu!jesup From: jesup@pawl17.pawl.rpi.edu (Randell E. Jesup) Newsgroups: comp.arch Subject: Re: Architectural analysis of RPM-40 for general usage [very long] Message-ID: <547@imagine.PAWL.RPI.EDU> Date: 18 Mar 88 07:59:34 GMT References: <1840@winchester.mips.COM> <514@imagine.PAWL.RPI.EDU> <1878@winchester.mips.COM> Sender: news@imagine.PAWL.RPI.EDU Reply-To: beowulf!lunge!jesup@steinmetz.UUCP Organization: RPI Public Access Workstation Lab - Troy, NY Lines: 163 Keywords: benchmarks architecture RISC In article <1878@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: :In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: ... :>>A1. .336 <= 3 cycles of load latency :.. :> Here another affect come into play: two address ALU instructions. When :>the compiler wants 3-address, it has to generate 2 ALU instructions. ... :Yes, this should help a little. The specific cases are the R2000 ones where :an lnop is immediately followed by: : a) a 3-register ALU op : b) an ALU op that uses an immediate >4 bits. I think you mean a load followed by an a) or b) and a nop. :I have no data to figure out how often those are. From looking at code, :it is sad, but true, that many lnops are structurally there almost no :matter what the architecture&compiler system are (with possible exceptions :of the VLIW systems). I.e., they are the things that result from code like: : if ((p != NULL) && (p->thing)) p = p->nextlink; :which gives code like: : lw reg1,p : nop : beq reg1,0,1f : nop : lw reg2,thingoffset(p1) : nop : beq reg2,0,1f : nop : lw reg1,nextlink(reg1) : nop I think you left out: sw p,reg1 :1: Here's how I'd do it on the rpm40 (and I hope the reorganizer would to) ldw .1, p nop nop nop cond neq, .1, .0 ldw .2, thingoffset[.1] cond neq, .1, .0 ldw .1, nextlink[.1] nop cond neq, .2, .0 pfx stw p, .1 Note that there are NO branches in it. The first load is hard to fill, you have to hope that you can do something for the next statement in those nops (I would bet you could use 1 of them, maybe 2, depends a lot on the code. Or you could move stuff down from above, like a store.) This is a good example how the cond instruction can help avoid branches (and thus branch delays and branch misses.) :>>A2.+ .279 <= Loads/stores use 4-bit immediates : :> Note: for word load/stores, that 4 bit immediate is shifted left 2, : The extra 2 bits for :words definitely help (especially in non-stack, non-GP references), Why non-stack? Stack refs are usually small offsets. I agree about globals, even with a global pointer to reduce the number of PFXs. :>>A8. .024 <= Jump-and-link :> Close, no cigar. BRA; ; MOV or STW. This helps fill branch :Thanx for the clarification: am I still wrong to think that there :are 3 cycles (of actual work) needed for a large-size branch-and-link? :(Ignoring branch-delay slots for the moment). You're correct, if the call is to greater than 2K instructions away from the current address. :The R2000's single branch-delay slot covers the time to access to I-cache.) Does that mean you can get an instruction 1 cycle after you output it's address on the address lines? :(I thought you might have one, but I didn't have any data. What I've heard :(from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly :hard to fill. We agree that stores often go into the branch delay.) One thing about the PFXs in the rpm40 is that often you see this: BRA; PFX; STW. The stanford numbers make a number of assumptions that aren't valid here, though 2 slots are harder to fill than one in general. :>>A20* .010 <= Miscellaneous :What I meant was: can you do left-shift 16 or right-shift 16 in a single :cycle, with no PFX (depends on how you interpret the immediate field)? :Not that it matters a lot, but it probably goes up to .005 if you can do :shifts by 15, but not 16, in one cycle. Ok, I understand now. I think for shifts it acts as an unsigned number if I remember correctly (Dennis?) :>>A7.+ .018 <= Load-upper-immediate @ :> .001 for a process smaller than 28 bits of instructions, we don't :> ever need to load the top 4 bits for address constants. For :> integer constants, if the top 5 are all 1 or 0, we don't need the :> top 4 (almost always the case). : Actually, almost all of these are for data-address, or logical : constants/masks, etc. They're almost never for instructions, so I'd : stick with the original number: .018. Once again, if the process has less than 28 bits of data memory, you don't need it. Few programs require 256 Meg of data. And of course 99+% of constants are less than 28 bits long, if you sign-extend. So I'll stick with the low number. :>>A12*. .033 <= Misses due to branch-target-cache misse :> .000 mistake - R2000 has misses too. : Actually not, or rather: the R2000 has cache misses in its external : cache; the RPM has cache misses in its internal cache + misses in : the external cache (this was assuming a design where the 64KI+64KD : memories were caches, rather than the only memory. The .033 here was : the EXTRA penalty for taking internal cache misses (which might well : be external cache hits). I subsumed all of the external cache missing : into the cycles->VUP conversion, since I didn't have a better way : to get at it, and all of the cycle counts so far were independent : of external cache designs. (Ask again if this doesn't make sense). : As noted, the .033 number depended on the 90% rate, and would go up : or down depending on the environment, but the number definitely is : not zero, unless I misunderstand how the RPM works. : .033 90% is something of lower bound, typical number for large programs should be more like 95%, 99%+ for small ones. I take it the r2000 never misses on a branch if the destination is in it's 64K cache? :Your revisions, plus my revisions to your revisions come out at .965. :My guess is that, unless there's some fundamental misunderstanding left, :that there is probably 90% confidence that the most programs of the sorts :modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!). I'll agree with you here, though I'll emphasize (as Dennis and I have said before) that the rpm-40 was optimized for a different range of applications than the r2000, and we're comparing here in the r2000's specialty. I suspect that for embedded systems and their programs, they would be close to equal (on a per-cycle basis), which would mean the 40Mhz RPM-40 would perform as well as a 35+Mhz r2000 (note: total specualtion here!) The RPM-40 can do large systems, but may not be the ultimate choice for such, though it can do well. :-john mashey DISCLAIMER: :UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com :DDD: 408-991-0253 or 408-720-1700, x253 :USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 Thanks again for such good analysis. I think this has been pretty well covered now; if we continue, lets start discussing specific RISC features and their plus/minuses (abstract, not rpm-40 vs r2000). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)