Path: utzoo!mnetor!uunet!lll-winken!lll-lcc!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Architectural analysis of RPM-40 for general usage [very long]
Message-ID: <1840@winchester.mips.COM>
Date: 10 Mar 88 18:48:09 GMT
Lines: 505
Keywords: benchmarks architecture RISC

Randall Jesup says the RPM40 stuff has been beaten to death, and I agree,
but I do have some data that may be useful, which I'd have posted before,
but has taken a while to write up, given that I've been busy.

As requested a while back by Mr. O'Connor, an analysis of the "40-MIPS" RPM-40
for workstation/general-purpose use (and only that):

Summary:
	40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP:
	(matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks)

40MIPS	(peak native mips, no load delays, pipeline hazards, etc)

20MIPS-cycles	(assuming no MMU, cache, or other memory system degradation),
	i.e., what a 20MHz MIPS R2000 would be without those effects.

14-15VUP (assuming MMU, cache, memory degradation same as MIPS R2000,
	and compiler technology identical to MIPSco's)
	[VUP: VAX 11/780 with good optimizers = 1 VUP, integer+fp].

12 VUP	realistic expectation of what one would run like in an actual
	general-purpose system, with the more vanilla compiler
	technology that is likely to be available. (correct me if wrong)

The rest of this long posting:
1. Issues of general-purpose UNIX systems
2. Doing an analysis of a machine you only have a spec of
3. Detailed analysis
4. Conclusion

------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS---

1. Fundamental issues, in which general-purpose UNIX systems may differ
from dedicated, small-memory, embedded systems:
	a) Programs are much bigger, i.e., they may easily have
	many megabytes of code, and 100's of megabytes of data, for
	single processes. (lots already = 16-32MB, at least)
	Since they have lots of data, they also do many loads and stores.

	b) The UNIX kernel is real nasty for high-performance machines:
	it has many sequences that are very hard to pipeline/reorg, it touches
	little bits of data here and there, and it sometimes has behavior
	that causes trouble for on-chip assists of otherwise reasoanble
	size (like branch-target buffers & register windows).
	The kernel also uses byte and halfword data, (both signed and
	unsigned), heavily-intermixed, because it uses heavily-packed
	data structures, which, for various reasons are difficult to
	change.  (Most user programs use halfwords less often).

	c) For general use, you sometimes have to make worst-case
	assumptions on the sizes of addresses, for example.  You usually
	must assume that addresses are "large", rather than "small",
	because you have no idea how big something is until it's linked,
	and because you may be compiling code for libraries where you
	can't know how big the final objects will be.  Hence, shortcuts
	often good in dedicated environments (i.e., "short" addresses are
	OK) don't work.

2. Doing the analysis.
	How do you get an estimate of performance for a machine you
	have no actual measurement data for?
	Here's a possible process:
	1) Find one or more machines that:
		-you have substantial data for (in this case, MIPS R2000)
		-are related to the machine to be estimated, where related means
		that it's fairly easy to do direct conversions to the
		machine to be estimated, and compute the architectural impacts.
	2) Feature by feature, do the recalibration, assuming the same
	software.  I.e., start with a base cycle count, and degrade as
	appropriate, or improve as appropriate.
	3) Having done the instruction-cycle analysis, degrade for cache,
	MMU, and RISC->VAX ratio effects.
	3) Finally, make some guesses on the impact on software of the
	architecture [after all, RISC design is half software.]

	In general, we'll make best-case assumptions, then come back
	and do guesses to recalibrate the optimism.

3. Detailed Analysis
The following lists features, with assumptions about them (if I'm wrong
about the assumptions, I'm happy to be corrected, and if somebody else
wants to use the data here to redo an analysis, that's fine.  I'm just using
what's published, plus the additional comments).  This analysis looks at the
raw cycle counts of the RPM40 as compared with a MIPS R2000,
(as the latter is the only one that I have detailed enough data for;
I might have used Stanford MIPS, as it is somewhat closer,
but I don't have the data).

I used a set of substantial user programs (as, ccom, doducd, espresso,
gnuchess, hspice, nroff, optimizer, code generator, timberwolf), written
in C, PASCAL, or FORTRAN.  I don't have kernel numbers handy, but I'll
occasionally note how the kernel differs.  Note that we first analyze the
instruction cycle counts, without regard to MMU and cache-miss behavior.
The "impact" number is the cycle-count degradation expected,
i.e., the "expansion" in number of cycles (over 1) due to the feature,
as compared with a "mythical" 40MHZ R2000.
Items marked * have a little more (but educated) guesswork than the others,
which are computed with high confidence from extensive data.
Items marked "+" are relevant to the 16-vs-32 issue.

A1.	.336 <= 3 cycles of load latency
	R2000 data: 21% of instruction cycles are loads (16-29%)
		68% of the (single) load-delay slots are filled (48-88%)
		(The usual numbers that I remember, but can't find the
		reference (and it wasn't bcase's posting), are that the
		expected fill rates for load-delays are:
		1: 70% (we got 68% on this bunch, and it contained one nasty
			program that only filled 48%).
		2: 30%
		3: 10%

	A1A: assuming the RPM has no more loads than an R2000
	A1B: assuming that similar-quality optimizers and reorganizers are used

	cycle-expansion = .21 * ( (1-.3) + (1-.1)) = .336
	(remember, we already counted the R2000's load-nops: the RPM40 would
	have equivalent stalls, in the same places.  These are the extras.
	(kernel is worse on this)

A2.+	.279 <= Loads/stores use 4-bit immediates
	R2000 data: 30.1% of instructions are loads+stores (22-38%)
		7.1% of the loads/stores could use a 4-bit offset (3.2-18%)
		R2000 load/stores have 16-bit offsets, which require the RPM40
		2 cycles to obtain:
	cycle-expansion = .301 * (1-.071) = .279

	Note: one might argue that there are clever code generation methods
	to reduce the number of places that addresses need to be calculated,
	and there are.  The MIPS compilers already do most of them,
	such as computing addresses, or parts of addresses, and saving them
	in registers, moving them outside of loops, etc.  The extra bits for
	larger offsets show up as LUIs (later).

A3.+	.035 <= Add/sub immediates [.050]
	R2000 data: 8.4% (5.8-12.6%) are add/subtract immediates
		3.5-5% of the total instructions are add/sub immediates
		whose constants don't fit into 4 bits.
		The 3.5% assumes there are both add/sub immediate (i.e.,
		4 bits + implied sign).  If there is really just add-immed,
		and you get 3 bits + sign, use the number 5%

	cycle-expansion = .035 * 1 = .035

A4.+	.013 <= Compare immediates (COND)
	R2000 data: About 1.7% [.2 - 4.8%] are compare-immediates or equivalent
		About 1.3% (of the total instructions) need >4 bits.
		About 1.3% fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:

	cycle-expansion = .013 * 1 = .013

A5.+	.013 <= Logical immediates
	R2000 data: about 1.7% [.4-5.7%] are logical-immediates
		About 1.3% fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:
	cycle-expansion = 0.013 * 1 = 0.013

	Note: kernel or other systems-type programs have higher rates of this.

A6.+	.011 <= Load-immediate
	This is used to stick constants in registers for arguments,
	compares, etc. (it's actually an add to zero, or something like that).
	I assume the RPM-40 has an equivalent.
	R2000 data: about 2.1% (.5-4.7%) are load-immediates
		About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:
	cycle-expansion = 0.011 * 1 = 0.011

A7.+	.018 <= Load-upper-immediate
	This puts zeroes in low bits, and 16 bits in the top of a register.
	It's often used for setting up 32-bit address, for example,
	or a long constant.
	R2000 data: 1.9% (.5-6.9%) are LUIs
		About 1.8% (of the total) need the top 4 bits, and thus
		would need an 2 PFXs, rather than 1 (or 4, rather than 3):
	cycle-expansion = .018 * 1 = .018

A8.	.024 <= Jump-and-link
	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...
	R2000 data: 1.2% of the instruction cycles (.3->2.0%) are
		calls (JAL, JALR), which give 28-bits of byte addressing.
	On the RPM-40,  a JAL is done by a 2-instruction pair, which
		I assume is a MOV of  the PC to the return-address register,
		followed by a branch, or something equivalent, although I
		haven't seen the exact sequence.  I assume that the return
		jump is a single cycle.
	Assuming programs of the size above, the natural code to
		generate would be:
		MOV (to save PC)
		PFX
		BRA
	i.e., because 12-bits of displacement are enough for local branches,
	(99+% on R2000), but not for globals. This adds 2 cycles/call.
	Assuming that branches get an extra bit of addressabiltiy (halfword
	alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code,
	which is adequate for all of the cited programs, although not so
	for others.  If by some chance the RPM uses a byte offset, getting
	only 12+12, (16MB), then it would, in normal use in general
	environment, probably take a hit of another 1.2% for a 2nd PFX,
	since there are already programs bigger than that floating around.

	cycle-expansion: .012 * 2 = .024

A9*.	.010 <= Load/store hazard
	The RPM40 is supposed to have a load/store hazard if a store occurs
	at the exact right number of cycles after a load.  This is not data
	we keep (since we don't have this hazard), and it's hard to compute.
	Here's a quick guess:
	R2000 data: about 10% of the instructions are stores, and 20% loads.
	Assume a 1-cycle hazard and a random distribution of instructions.
	
	Then, on the average, a store would find a load in the wrong spot 20%
	of the time, i.e.
	
	cycle-expansion: .1 (fraction of stores) * .2 (fraction of loads) = .02

	Let's assume the reorganizer can fix half of these (optimistic, given
	that we've found loads/stores are the hardest to move around).
	Thus, we get the cycle penalty as .01.  If the hazard is multi-cycle,
	this goes up, or if the reorganizer has a hard time, it goes up.

A10*.+	.098 <= Conditional branch
	Since the RPM40 has no conditional branch that has an address,
	the R2000's equivalents would be done with COND/BRA pairs.
	This requires a little estimation, as it is not something we track
	directly, i.e., those cases where the COND is able to skip just
	1-instruction-equivalent (in the RPM40 sense), without having to
	change into the COND/BRA pair.
	R2000 data: 11.4% (6-17%) of instruction cycles are conditional
		branches.  Approximately 30% of branches are backwards
		branches, which of course cannot be done with just COND.
		To get a quick estimate of the rest, I disassembled a few
		programs, and eyeballed the forward conditional branches,
		and assumed that conditional forward branches that jumped
		no more than 8 instructions might end up being CONDs that
		just skipped.  A few quick looks convinced me this was
		less than 10%.  However, I'll be conservative and allow
		20%, i.e., .8 (of the 70% forward branches) cost a cycle.

	cycle expansion: .114 * (.3 + .7 * .8) * 1 = .098

A11*.+	.010 <= Partial-word load/store
	The RPM40 does load/store of partial-word by setting a
	register (SR2) with a field that gives the type of access,
	then doing the load/store with "non-word" type.
	R2000 data: 6% of instruction cycles [0-19% !!-varies heavily] are
		partial-word load/stores.  Systems-type code is usually
		heavier, numerical code lighter on these.
	Assuming that it costs just 1 cycle to set SR2 (and it might be
	2, if it requires an OR/XOR pair or equivalent), we can guess how
	often this has to happen:
	Simple code generator: every time: 6% hit.
	Smarter code generator: track within basic block
	Global optimizer: track everywhere in a function.
		Under the best case, a function must treat SR2 as a register
		to be saved/restored if modified (they may well already do
		this), or else, it can be considered "dead" across function
		calls, and reset appropriately.
	Systems code, especially with lots of small functions, will feel
	a hit from this more than numeric code.
	I'll guess that there is a cycle hit (either from doing unnecessary
	save/restore, or having to reset this 1/6 times):
	cycle expansion: .01 * 1 = .01
	Note: this would definitely be worse in the kernel.

A12*.	.033 <= Misses due to branch-target-cache misse
	According to the talk, there was supposed to be a hit-rate >90%.
	I have no data on the kinds of programs used to calibrate that.
	Big programs are clearly worse than little programs in this regard;
	for example, a call thru the UNIX kernel (like read) misses most of
	the time, because it often goes 10 levels deep (10 function entries +
	10 return-sites = 20, leaving only 12 others for all other branches).

	R2000 data: over a set of similar benchmarks (including most of
	those above), 11.4% of the instructions are non-sequential,
	i.e., taken conditional branches plus all unconditional branches.
	Assuming a 3-cycle penalty for a miss (10% of the time), this gives:
	cycle expansion = .114 * (.1 * 3) = .033

A13*.	.050 <= Lack of ALU forwarding
	From all of the discussion, I can't tell how much bypassing the
	RPM40 does, or doesn't do.  From the various hints, it sounds like:
	a) you can store the output of an ALU op with no delay
	b) you can't otherwise use the result of an immediately-preceding
	ALU op.
	It would be nice if people said what is really going on, but given the
	sprited defense of non-bypassing, this might be a reasonable guess.
	This is another one we don't keep detailed statistics on (since we
	bypass everything). I disassembled several programs, picked randomly-
	chosen sequences of code (typically about 200 instructions), and
	eyeballed them to see what % of them would take 1-cycle hits if
	there were no bypassing, and otherwise following what would be the
	RPM40 rules, i.e., using PFX instead of R2000's immediates, etc.
	I eliminated cases where it looked like a reorganizer could get
	rid of them (not many, this is already heavily reorged code,
	and since many programs have only 5-10 instructions per basic block,
	and it is difficult to rearrange this stuff thru basic blocks,
	it is fairly easy to make a good guess on local information).
	Of course the numbers are static, not dynamic.  On the other hand,
	they're probably a reasonable guess (and note: this is not the
	bypassing hit an R2000 would take, it's the hit it would take if
	its code were transformed into an RPM40's):  The 3 samples I took
	gave 2.5, 5.6, and 7% hits from this, mostly because code doesn't
	easily migrate amongst basic blocks, other than at the boundaries.
	NOTE: after I did this, I see Earl Killian got the dynamic numbers
	for the R2000, which are as expected, much higher, given that we
	have forwarding and use it everywhere. 
	I'll guess 5%, which I think is reasonably conservative:
	cycle expansion: .05

A14*.	.020 <= Multiply-divide
	I'm not exactly sure how this is implemented on the RPM,
	although it probably doesn't have a fast multiplier on the CPU.
	(Maybe it's on the FPU, in which case some of this would go away).
	R2000 data: across the benchmarks, about 2.7% of the time was
	spent in multiply/divide interlock cycles [we have a 12/35 cycle
	multiple/divider], which includes the effects of having some
	instructions scheduled into the latency of an asynchronous unit.
	Assume that the interlock cycles are split 50/50 (faster mults,
	but more mults than divides, and that RPM's mults take about 3X,
	i.e., +2 factor, you get:
	cycle expansion: .013 * 2 = .026, which I'll round down to 2.

A15*.+	.040 <= 2-address registers, rather than 3-address ones
	This is another one that took some guessing, and is especially
	hard to know, given the willingness of our optimizers to keep
	things in registers rather than refetching them.
	From a static eyeballing of dis-assembled code, I saw 3-7%
	of 3-address ALU ops that looked like they needed to be that way,
	i.e., where a 2-address op would take a 1-cycle hit.
	I'll guess conservatively, 4%.
	cycle expansion: .04

A16*.	.039 <= Less registers (21 instead of 32)
	This definitely takes some guessing, as it has been a while since we
	bounced the compiler's register allocation around.  We do know we
	were getting good use of the extra registers up through 24-28,
	which is the number we actually left for normal code.
	(We hear that similar results were gotten at IBM & HP, but have no
	reference, as the "hearing" was from bar conversations.)
	Guess: assume having less registers (and not as symmetric) costs
	5% more loads/stores, which may well be (the R2000 has just enough
	registers that it seldom saves/restores any on a call to a leaf).
	Assuming 20% loads and 10% stores (close), we get:
	
	cycle expansion = .05 * (.2 + .1) = .015 (basic instrs)
		plus (the load-delays as in A1, except now the .3 unfilled
		1st delay slots are extra):
		.05 * (.3 + .7 + .9) = .01
		plus the 93% of ld+st that need >4 bits offset:
		.015 * .93 = .014
	Total: .015 + .01 + .014= .039

	Note: this is more of an effect for large, complex programs than
	small ones.  A cross-check is that you get the same number if
	you assume even 1 more register is saved/restored on the average per
	function call, (PFX+SW, PFX+LW for 4 cycles), with an average
	of 100 instructions/call (representative).

A17*.	.020 <= Architectural Reorganization issues
	Several of the times above related to reorganization:(A1, A9, A13).
	A number of factors appear to make the RPM40 more difficult to
	reorganize for:
	1) PFX instructions are difficult, if not impossible to move
	around away from the instructions prefixed (unlike the R2000's
	style of using a bypassed GP register).
	2) The instruction that sets the SR2 to get partial-word operations
	is hard to move "too far" away from the instruction(s) that need it.
	3) The load/store pipeline hazard must be taken care of.
	4) If there is no forwarding, that has to be reorganized also.
	Reorganization is very important for many RISC processors.
	The RPM40 has a some extra things to worry about, and one less
	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.
	(Note: this is the ripple effect of all of this together, not of
	the individual pieces, which we've already counted.)

A18*.	?? <= Coprocessor issues
	I haven't really touched on this very much, as we don't know much,
	except that even a few cycles extra latency getting to a floating-point
	unit can hurt a lot, except in applications that naturally pipeline
	very well, or if the FPU has long cycle-count operations in the first
	place.  XPLD without XPST is a little puzzling. 

A19*.	-.067 <= Contraction issue (R2000 branch nops)
	The R2000 loses 6-7% to unfilled branch delay slots,
	which the RPM40 does not. (of course, the RPM40 takes hits in other
	areas of branching, but we've already included them).

A20*	.010 <= Miscellaneous
	There are a bunch of integer-related issues that I can only guess
	at, but observing that there are 4 bits in the opcode field for
	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
	ops are found in the RPM, although I don't know which ones they
	might be.  Also, if the immediate field encodes 16-bit shifts,
	that will help, and hurt, if not.

Bottom line, given everything I know:
A1.	.336 <= 3 cycles of load latency @
A2.+	.279 <= Loads/stores use 4-bit immediates @
A3.+	.035 <= Add/sub immediates @
A4.+	.013 <= Compare immediates @
A5.+	.013 <= Logical immediates @
A6.+	.011 <= Load-immediate @
A7.+	.018 <= Load-upper-immediate @
A8.+	.024 <= Jump-and-link @
A9*.	.010 <= Load/store hazard
A10*.+	.098 <= Conditional branch
A11*.+	.010 <= Partial-word load/store @
A12*.	.033 <= Misses due to branch-target-cache misse
A13*.	.050 <= Lack of ALU forwarding
A14*.	.020 <= Multiply-divide
A15*.+	.040 <= 2-address registers, rather than 3-address ones
A16*.+	.039 <= Less registers (21 instead of 32, sort of 
A17*.	.020 <= Architectural Reorganization issues
A18*.	?? <= Coprocessor issues
A19*.	-.067 <= Contraction issue (R2000 branch nops)
A20*	.010 <= Miscellaneous

Total	0.992	cycle expansion

As can be seen, a few items account for big pieces, but the little ones
add up to a lot of cycles if there are enough of them (as Everett Dirksen
once said, referring to the US budget, "a billion here, a billion there,
sooner or later it adds up to real money.") (or real cycles)

Grossly, what this means is that an RPM40, with equivalent compiler
technology, takes about (1+.99) (i.e. about 2, well within the
accuracy of this analysis!) instruction cycles more than an R2000, RUNNING LARGE
PROGRAMS. Some of the analysis may be wrong, but the fundamentals are there.
	Thus, for cycle counts (ignoring cache-miss & MMU overhead),
	a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e.,
	it would run twice as many (instruction cycles + delay cycles).

	Now, a 16.7Hz R2000 [the fastest we've actually shipped],
	with 64K I + 64K D-caches, acts like a 12-VUP (VAX-mips,
	versus 11/780 with good compilers, integer+float) machine, running
	realistic programs.  Let us note that these were in production
	4Q87, and do require a low-latency memory system (as in the
	Whitechapel Workstation version, for example).
	Thus, the ratio is 16.7MHz/12 VUP = 1.39 cycles/VUP, to go from
	raw cycles (including delays, but not MMU or cache-overhead),
	to actual measured performance.

	To estimate the performance of either a 20MHz R2000 or
	40MHz RPM40, and assuming that the RPM40's floating-point
	is as good as the R2000+R2010, and that it handles MMU and
	cache control as well as the R2000 (which does have these built in),
	and generally ignoring the scale-up effects of relatively slower
	main memory for both of them:

	20 / 1.39 = 14.4 VUP

which is well inside  "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates
given by various of the GE folks.

So far, all of this has been architectural, i.e., assuming that the
RPM40's software was as close to the R2000's as possible, i.e.,
what would it be if they had our compilers. It is hard to compare
against something you don't know, but I would observe:
	a) Our compiler suite took years of effort by people with
	tons of experience, including people on their 2nd or 3rd
	set of RISC compilers.
	b) This is not an advertisement, but those who've seen
	our compiler technology have some respect for it. In addition,
	the RPM40 will be slightly more difficult to do a good global
	optimizer for:
		assymetries in the register set(i.e., the ones above 15)
		the flag bits for word-type turn into something that
			needs to be tracked
	and some of the trickeries we used (like the
	global-register pointer) are less useful on a machine with
	less registers and shorter offsets.  I don't think anyone who's
	actually seen our compilers go would complain too much if I
	knocked off 25% for more vanilla compiler technology [this is
	well within our experiences looking at different levels of
	optimization.]  This one is arguable, but gets at the "if you
	had this machine today, how would it really act? comparison.

	14.4 VUP (for RPM40) * .75 = 12 VUP

	Given that architectural tradeoffs are what they are, it's always
	hard to separate a single feature out, but I trust this has
	given some ideas on the 16vs32-bit instruction issue, and backs up
	Craig Hansen's thoughts about cycle-count increases on 16-bitters.

4. Conclusion

This analysis is only relevant to running substantial programs
of the kinds found on general-purpose machines.  The RPM may well
do relatively better in more embedded-systems environments where the
tradeoffs work better, and there is nothing wrong/irrelevant with those
environments; they just aren't workstation environments, and whatever
one learns in either one doesn't necessarily translate to the other,
BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS.

I apologize for any errors in this analysis, which took a fair
amount of time to put together, given the sketchiness of the info,
but which, I believe, is accurate to within the correctness
of the assumptions, and does reflect a modest amount of experience
with such things.  Feel free to fix it if it's wrong, and if it matters,
but as Mr. Jesup says, this has been beaten to death.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086