Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Architectural analysis of RPM-40 for general usage [very long] Message-ID: <1840@winchester.mips.COM> Date: 10 Mar 88 18:48:09 GMT Lines: 505 Keywords: benchmarks architecture RISC Randall Jesup says the RPM40 stuff has been beaten to death, and I agree, but I do have some data that may be useful, which I'd have posted before, but has taken a while to write up, given that I've been busy. As requested a while back by Mr. O'Connor, an analysis of the "40-MIPS" RPM-40 for workstation/general-purpose use (and only that): Summary: 40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP: (matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks) 40MIPS (peak native mips, no load delays, pipeline hazards, etc) 20MIPS-cycles (assuming no MMU, cache, or other memory system degradation), i.e., what a 20MHz MIPS R2000 would be without those effects. 14-15VUP (assuming MMU, cache, memory degradation same as MIPS R2000, and compiler technology identical to MIPSco's) [VUP: VAX 11/780 with good optimizers = 1 VUP, integer+fp]. 12 VUP realistic expectation of what one would run like in an actual general-purpose system, with the more vanilla compiler technology that is likely to be available. (correct me if wrong) The rest of this long posting: 1. Issues of general-purpose UNIX systems 2. Doing an analysis of a machine you only have a spec of 3. Detailed analysis 4. Conclusion ------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS--- 1. Fundamental issues, in which general-purpose UNIX systems may differ from dedicated, small-memory, embedded systems: a) Programs are much bigger, i.e., they may easily have many megabytes of code, and 100's of megabytes of data, for single processes. (lots already = 16-32MB, at least) Since they have lots of data, they also do many loads and stores. b) The UNIX kernel is real nasty for high-performance machines: it has many sequences that are very hard to pipeline/reorg, it touches little bits of data here and there, and it sometimes has behavior that causes trouble for on-chip assists of otherwise reasoanble size (like branch-target buffers & register windows). The kernel also uses byte and halfword data, (both signed and unsigned), heavily-intermixed, because it uses heavily-packed data structures, which, for various reasons are difficult to change. (Most user programs use halfwords less often). c) For general use, you sometimes have to make worst-case assumptions on the sizes of addresses, for example. You usually must assume that addresses are "large", rather than "small", because you have no idea how big something is until it's linked, and because you may be compiling code for libraries where you can't know how big the final objects will be. Hence, shortcuts often good in dedicated environments (i.e., "short" addresses are OK) don't work. 2. Doing the analysis. How do you get an estimate of performance for a machine you have no actual measurement data for? Here's a possible process: 1) Find one or more machines that: -you have substantial data for (in this case, MIPS R2000) -are related to the machine to be estimated, where related means that it's fairly easy to do direct conversions to the machine to be estimated, and compute the architectural impacts. 2) Feature by feature, do the recalibration, assuming the same software. I.e., start with a base cycle count, and degrade as appropriate, or improve as appropriate. 3) Having done the instruction-cycle analysis, degrade for cache, MMU, and RISC->VAX ratio effects. 3) Finally, make some guesses on the impact on software of the architecture [after all, RISC design is half software.] In general, we'll make best-case assumptions, then come back and do guesses to recalibrate the optimism. 3. Detailed Analysis The following lists features, with assumptions about them (if I'm wrong about the assumptions, I'm happy to be corrected, and if somebody else wants to use the data here to redo an analysis, that's fine. I'm just using what's published, plus the additional comments). This analysis looks at the raw cycle counts of the RPM40 as compared with a MIPS R2000, (as the latter is the only one that I have detailed enough data for; I might have used Stanford MIPS, as it is somewhat closer, but I don't have the data). I used a set of substantial user programs (as, ccom, doducd, espresso, gnuchess, hspice, nroff, optimizer, code generator, timberwolf), written in C, PASCAL, or FORTRAN. I don't have kernel numbers handy, but I'll occasionally note how the kernel differs. Note that we first analyze the instruction cycle counts, without regard to MMU and cache-miss behavior. The "impact" number is the cycle-count degradation expected, i.e., the "expansion" in number of cycles (over 1) due to the feature, as compared with a "mythical" 40MHZ R2000. Items marked * have a little more (but educated) guesswork than the others, which are computed with high confidence from extensive data. Items marked "+" are relevant to the 16-vs-32 issue. A1. .336 <= 3 cycles of load latency R2000 data: 21% of instruction cycles are loads (16-29%) 68% of the (single) load-delay slots are filled (48-88%) (The usual numbers that I remember, but can't find the reference (and it wasn't bcase's posting), are that the expected fill rates for load-delays are: 1: 70% (we got 68% on this bunch, and it contained one nasty program that only filled 48%). 2: 30% 3: 10% A1A: assuming the RPM has no more loads than an R2000 A1B: assuming that similar-quality optimizers and reorganizers are used cycle-expansion = .21 * ( (1-.3) + (1-.1)) = .336 (remember, we already counted the R2000's load-nops: the RPM40 would have equivalent stalls, in the same places. These are the extras. (kernel is worse on this) A2.+ .279 <= Loads/stores use 4-bit immediates R2000 data: 30.1% of instructions are loads+stores (22-38%) 7.1% of the loads/stores could use a 4-bit offset (3.2-18%) R2000 load/stores have 16-bit offsets, which require the RPM40 2 cycles to obtain: cycle-expansion = .301 * (1-.071) = .279 Note: one might argue that there are clever code generation methods to reduce the number of places that addresses need to be calculated, and there are. The MIPS compilers already do most of them, such as computing addresses, or parts of addresses, and saving them in registers, moving them outside of loops, etc. The extra bits for larger offsets show up as LUIs (later). A3.+ .035 <= Add/sub immediates [.050] R2000 data: 8.4% (5.8-12.6%) are add/subtract immediates 3.5-5% of the total instructions are add/sub immediates whose constants don't fit into 4 bits. The 3.5% assumes there are both add/sub immediate (i.e., 4 bits + implied sign). If there is really just add-immed, and you get 3 bits + sign, use the number 5% cycle-expansion = .035 * 1 = .035 A4.+ .013 <= Compare immediates (COND) R2000 data: About 1.7% [.2 - 4.8%] are compare-immediates or equivalent About 1.3% (of the total instructions) need >4 bits. About 1.3% fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = .013 * 1 = .013 A5.+ .013 <= Logical immediates R2000 data: about 1.7% [.4-5.7%] are logical-immediates About 1.3% fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = 0.013 * 1 = 0.013 Note: kernel or other systems-type programs have higher rates of this. A6.+ .011 <= Load-immediate This is used to stick constants in registers for arguments, compares, etc. (it's actually an add to zero, or something like that). I assume the RPM-40 has an equivalent. R2000 data: about 2.1% (.5-4.7%) are load-immediates About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = 0.011 * 1 = 0.011 A7.+ .018 <= Load-upper-immediate This puts zeroes in low bits, and 16 bits in the top of a register. It's often used for setting up 32-bit address, for example, or a long constant. R2000 data: 1.9% (.5-6.9%) are LUIs About 1.8% (of the total) need the top 4 bits, and thus would need an 2 PFXs, rather than 1 (or 4, rather than 3): cycle-expansion = .018 * 1 = .018 A8. .024 <= Jump-and-link JAL is done in the RPM-40 by MOV (PC) somplace; BRA... R2000 data: 1.2% of the instruction cycles (.3->2.0%) are calls (JAL, JALR), which give 28-bits of byte addressing. On the RPM-40, a JAL is done by a 2-instruction pair, which I assume is a MOV of the PC to the return-address register, followed by a branch, or something equivalent, although I haven't seen the exact sequence. I assume that the return jump is a single cycle. Assuming programs of the size above, the natural code to generate would be: MOV (to save PC) PFX BRA i.e., because 12-bits of displacement are enough for local branches, (99+% on R2000), but not for globals. This adds 2 cycles/call. Assuming that branches get an extra bit of addressabiltiy (halfword alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code, which is adequate for all of the cited programs, although not so for others. If by some chance the RPM uses a byte offset, getting only 12+12, (16MB), then it would, in normal use in general environment, probably take a hit of another 1.2% for a 2nd PFX, since there are already programs bigger than that floating around. cycle-expansion: .012 * 2 = .024 A9*. .010 <= Load/store hazard The RPM40 is supposed to have a load/store hazard if a store occurs at the exact right number of cycles after a load. This is not data we keep (since we don't have this hazard), and it's hard to compute. Here's a quick guess: R2000 data: about 10% of the instructions are stores, and 20% loads. Assume a 1-cycle hazard and a random distribution of instructions. Then, on the average, a store would find a load in the wrong spot 20% of the time, i.e. cycle-expansion: .1 (fraction of stores) * .2 (fraction of loads) = .02 Let's assume the reorganizer can fix half of these (optimistic, given that we've found loads/stores are the hardest to move around). Thus, we get the cycle penalty as .01. If the hazard is multi-cycle, this goes up, or if the reorganizer has a hard time, it goes up. A10*.+ .098 <= Conditional branch Since the RPM40 has no conditional branch that has an address, the R2000's equivalents would be done with COND/BRA pairs. This requires a little estimation, as it is not something we track directly, i.e., those cases where the COND is able to skip just 1-instruction-equivalent (in the RPM40 sense), without having to change into the COND/BRA pair. R2000 data: 11.4% (6-17%) of instruction cycles are conditional branches. Approximately 30% of branches are backwards branches, which of course cannot be done with just COND. To get a quick estimate of the rest, I disassembled a few programs, and eyeballed the forward conditional branches, and assumed that conditional forward branches that jumped no more than 8 instructions might end up being CONDs that just skipped. A few quick looks convinced me this was less than 10%. However, I'll be conservative and allow 20%, i.e., .8 (of the 70% forward branches) cost a cycle. cycle expansion: .114 * (.3 + .7 * .8) * 1 = .098 A11*.+ .010 <= Partial-word load/store The RPM40 does load/store of partial-word by setting a register (SR2) with a field that gives the type of access, then doing the load/store with "non-word" type. R2000 data: 6% of instruction cycles [0-19% !!-varies heavily] are partial-word load/stores. Systems-type code is usually heavier, numerical code lighter on these. Assuming that it costs just 1 cycle to set SR2 (and it might be 2, if it requires an OR/XOR pair or equivalent), we can guess how often this has to happen: Simple code generator: every time: 6% hit. Smarter code generator: track within basic block Global optimizer: track everywhere in a function. Under the best case, a function must treat SR2 as a register to be saved/restored if modified (they may well already do this), or else, it can be considered "dead" across function calls, and reset appropriately. Systems code, especially with lots of small functions, will feel a hit from this more than numeric code. I'll guess that there is a cycle hit (either from doing unnecessary save/restore, or having to reset this 1/6 times): cycle expansion: .01 * 1 = .01 Note: this would definitely be worse in the kernel. A12*. .033 <= Misses due to branch-target-cache misse According to the talk, there was supposed to be a hit-rate >90%. I have no data on the kinds of programs used to calibrate that. Big programs are clearly worse than little programs in this regard; for example, a call thru the UNIX kernel (like read) misses most of the time, because it often goes 10 levels deep (10 function entries + 10 return-sites = 20, leaving only 12 others for all other branches). R2000 data: over a set of similar benchmarks (including most of those above), 11.4% of the instructions are non-sequential, i.e., taken conditional branches plus all unconditional branches. Assuming a 3-cycle penalty for a miss (10% of the time), this gives: cycle expansion = .114 * (.1 * 3) = .033 A13*. .050 <= Lack of ALU forwarding From all of the discussion, I can't tell how much bypassing the RPM40 does, or doesn't do. From the various hints, it sounds like: a) you can store the output of an ALU op with no delay b) you can't otherwise use the result of an immediately-preceding ALU op. It would be nice if people said what is really going on, but given the sprited defense of non-bypassing, this might be a reasonable guess. This is another one we don't keep detailed statistics on (since we bypass everything). I disassembled several programs, picked randomly- chosen sequences of code (typically about 200 instructions), and eyeballed them to see what % of them would take 1-cycle hits if there were no bypassing, and otherwise following what would be the RPM40 rules, i.e., using PFX instead of R2000's immediates, etc. I eliminated cases where it looked like a reorganizer could get rid of them (not many, this is already heavily reorged code, and since many programs have only 5-10 instructions per basic block, and it is difficult to rearrange this stuff thru basic blocks, it is fairly easy to make a good guess on local information). Of course the numbers are static, not dynamic. On the other hand, they're probably a reasonable guess (and note: this is not the bypassing hit an R2000 would take, it's the hit it would take if its code were transformed into an RPM40's): The 3 samples I took gave 2.5, 5.6, and 7% hits from this, mostly because code doesn't easily migrate amongst basic blocks, other than at the boundaries. NOTE: after I did this, I see Earl Killian got the dynamic numbers for the R2000, which are as expected, much higher, given that we have forwarding and use it everywhere. I'll guess 5%, which I think is reasonably conservative: cycle expansion: .05 A14*. .020 <= Multiply-divide I'm not exactly sure how this is implemented on the RPM, although it probably doesn't have a fast multiplier on the CPU. (Maybe it's on the FPU, in which case some of this would go away). R2000 data: across the benchmarks, about 2.7% of the time was spent in multiply/divide interlock cycles [we have a 12/35 cycle multiple/divider], which includes the effects of having some instructions scheduled into the latency of an asynchronous unit. Assume that the interlock cycles are split 50/50 (faster mults, but more mults than divides, and that RPM's mults take about 3X, i.e., +2 factor, you get: cycle expansion: .013 * 2 = .026, which I'll round down to 2. A15*.+ .040 <= 2-address registers, rather than 3-address ones This is another one that took some guessing, and is especially hard to know, given the willingness of our optimizers to keep things in registers rather than refetching them. From a static eyeballing of dis-assembled code, I saw 3-7% of 3-address ALU ops that looked like they needed to be that way, i.e., where a 2-address op would take a 1-cycle hit. I'll guess conservatively, 4%. cycle expansion: .04 A16*. .039 <= Less registers (21 instead of 32) This definitely takes some guessing, as it has been a while since we bounced the compiler's register allocation around. We do know we were getting good use of the extra registers up through 24-28, which is the number we actually left for normal code. (We hear that similar results were gotten at IBM & HP, but have no reference, as the "hearing" was from bar conversations.) Guess: assume having less registers (and not as symmetric) costs 5% more loads/stores, which may well be (the R2000 has just enough registers that it seldom saves/restores any on a call to a leaf). Assuming 20% loads and 10% stores (close), we get: cycle expansion = .05 * (.2 + .1) = .015 (basic instrs) plus (the load-delays as in A1, except now the .3 unfilled 1st delay slots are extra): .05 * (.3 + .7 + .9) = .01 plus the 93% of ld+st that need >4 bits offset: .015 * .93 = .014 Total: .015 + .01 + .014= .039 Note: this is more of an effect for large, complex programs than small ones. A cross-check is that you get the same number if you assume even 1 more register is saved/restored on the average per function call, (PFX+SW, PFX+LW for 4 cycles), with an average of 100 instructions/call (representative). A17*. .020 <= Architectural Reorganization issues Several of the times above related to reorganization:(A1, A9, A13). A number of factors appear to make the RPM40 more difficult to reorganize for: 1) PFX instructions are difficult, if not impossible to move around away from the instructions prefixed (unlike the R2000's style of using a bypassed GP register). 2) The instruction that sets the SR2 to get partial-word operations is hard to move "too far" away from the instruction(s) that need it. 3) The load/store pipeline hazard must be taken care of. 4) If there is no forwarding, that has to be reorganized also. Reorganization is very important for many RISC processors. The RPM40 has a some extra things to worry about, and one less (R2000 branch delay slot). I'd guess the overall hit to be 2%. (Note: this is the ripple effect of all of this together, not of the individual pieces, which we've already counted.) A18*. ?? <= Coprocessor issues I haven't really touched on this very much, as we don't know much, except that even a few cycles extra latency getting to a floating-point unit can hurt a lot, except in applications that naturally pipeline very well, or if the FPU has long cycle-count operations in the first place. XPLD without XPST is a little puzzling. A19*. -.067 <= Contraction issue (R2000 branch nops) The R2000 loses 6-7% to unfilled branch delay slots, which the RPM40 does not. (of course, the RPM40 takes hits in other areas of branching, but we've already included them). A20* .010 <= Miscellaneous There are a bunch of integer-related issues that I can only guess at, but observing that there are 4 bits in the opcode field for ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's ops are found in the RPM, although I don't know which ones they might be. Also, if the immediate field encodes 16-bit shifts, that will help, and hurt, if not. Bottom line, given everything I know: A1. .336 <= 3 cycles of load latency @ A2.+ .279 <= Loads/stores use 4-bit immediates @ A3.+ .035 <= Add/sub immediates @ A4.+ .013 <= Compare immediates @ A5.+ .013 <= Logical immediates @ A6.+ .011 <= Load-immediate @ A7.+ .018 <= Load-upper-immediate @ A8.+ .024 <= Jump-and-link @ A9*. .010 <= Load/store hazard A10*.+ .098 <= Conditional branch A11*.+ .010 <= Partial-word load/store @ A12*. .033 <= Misses due to branch-target-cache misse A13*. .050 <= Lack of ALU forwarding A14*. .020 <= Multiply-divide A15*.+ .040 <= 2-address registers, rather than 3-address ones A16*.+ .039 <= Less registers (21 instead of 32, sort of A17*. .020 <= Architectural Reorganization issues A18*. ?? <= Coprocessor issues A19*. -.067 <= Contraction issue (R2000 branch nops) A20* .010 <= Miscellaneous Total 0.992 cycle expansion As can be seen, a few items account for big pieces, but the little ones add up to a lot of cycles if there are enough of them (as Everett Dirksen once said, referring to the US budget, "a billion here, a billion there, sooner or later it adds up to real money.") (or real cycles) Grossly, what this means is that an RPM40, with equivalent compiler technology, takes about (1+.99) (i.e. about 2, well within the accuracy of this analysis!) instruction cycles more than an R2000, RUNNING LARGE PROGRAMS. Some of the analysis may be wrong, but the fundamentals are there. Thus, for cycle counts (ignoring cache-miss & MMU overhead), a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e., it would run twice as many (instruction cycles + delay cycles). Now, a 16.7Hz R2000 [the fastest we've actually shipped], with 64K I + 64K D-caches, acts like a 12-VUP (VAX-mips, versus 11/780 with good compilers, integer+float) machine, running realistic programs. Let us note that these were in production 4Q87, and do require a low-latency memory system (as in the Whitechapel Workstation version, for example). Thus, the ratio is 16.7MHz/12 VUP = 1.39 cycles/VUP, to go from raw cycles (including delays, but not MMU or cache-overhead), to actual measured performance. To estimate the performance of either a 20MHz R2000 or 40MHz RPM40, and assuming that the RPM40's floating-point is as good as the R2000+R2010, and that it handles MMU and cache control as well as the R2000 (which does have these built in), and generally ignoring the scale-up effects of relatively slower main memory for both of them: 20 / 1.39 = 14.4 VUP which is well inside "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates given by various of the GE folks. So far, all of this has been architectural, i.e., assuming that the RPM40's software was as close to the R2000's as possible, i.e., what would it be if they had our compilers. It is hard to compare against something you don't know, but I would observe: a) Our compiler suite took years of effort by people with tons of experience, including people on their 2nd or 3rd set of RISC compilers. b) This is not an advertisement, but those who've seen our compiler technology have some respect for it. In addition, the RPM40 will be slightly more difficult to do a good global optimizer for: assymetries in the register set(i.e., the ones above 15) the flag bits for word-type turn into something that needs to be tracked and some of the trickeries we used (like the global-register pointer) are less useful on a machine with less registers and shorter offsets. I don't think anyone who's actually seen our compilers go would complain too much if I knocked off 25% for more vanilla compiler technology [this is well within our experiences looking at different levels of optimization.] This one is arguable, but gets at the "if you had this machine today, how would it really act? comparison. 14.4 VUP (for RPM40) * .75 = 12 VUP Given that architectural tradeoffs are what they are, it's always hard to separate a single feature out, but I trust this has given some ideas on the 16vs32-bit instruction issue, and backs up Craig Hansen's thoughts about cycle-count increases on 16-bitters. 4. Conclusion This analysis is only relevant to running substantial programs of the kinds found on general-purpose machines. The RPM may well do relatively better in more embedded-systems environments where the tradeoffs work better, and there is nothing wrong/irrelevant with those environments; they just aren't workstation environments, and whatever one learns in either one doesn't necessarily translate to the other, BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS. I apologize for any errors in this analysis, which took a fair amount of time to put together, given the sketchiness of the info, but which, I believe, is accurate to within the correctness of the assumptions, and does reflect a modest amount of experience with such things. Feel free to fix it if it's wrong, and if it matters, but as Mr. Jesup says, this has been beaten to death. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086