Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!cornell!batcomputer!itsgw!imagine!pawl8.pawl.rpi.edu!jesup
From: jesup@pawl8.pawl.rpi.edu (Randell E. Jesup)
Newsgroups: comp.arch
Subject: Re: Architectural analysis of RPM-40 for general usage [very long]
Message-ID: <514@imagine.PAWL.RPI.EDU>
Date: 11 Mar 88 09:36:48 GMT
References: <1840@winchester.mips.COM>
Sender: news@imagine.PAWL.RPI.EDU
Reply-To: beowulf!lunge!jesup@steinmetz.UUCP
Organization: RPI Public Access Workstation Lab - Troy, NY
Lines: 396
Keywords: benchmarks architecture RISC

In article <1840@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Randall Jesup says the RPM40 stuff has been beaten to death, and I agree,
>but I do have some data that may be useful, which I'd have posted before,
>but has taken a while to write up, given that I've been busy.

	And I thank you for it.  Numbers/real comparison stuff is great,
"my chip is better than yours" stuff get old fast.  I'll keep my comments
to a minimum, given the length of the original posting.

>Summary:
>	40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP:
>	(matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks)

>------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS---

>	a) Programs are much bigger, i.e., they may easily have
>	many megabytes of code, and 100's of megabytes of data, for
>	single processes. (lots already = 16-32MB, at least)

	The average program run isn't anywhere near that big, though.

>	c) For general use, you sometimes have to make worst-case
>	assumptions on the sizes of addresses, for example.  You usually
>	must assume that addresses are "large", rather than "small",
>	because you have no idea how big something is until it's linked,
>	and because you may be compiling code for libraries where you
>	can't know how big the final objects will be.  Hence, shortcuts
>	often good in dedicated environments (i.e., "short" addresses are
>	OK) don't work.

	Here I have to disagree with you.  How about this:  The object code
is compressed assembler, the linker links (with compressed assembler libraries)
and does reorganization/assembly.  This allows you to always know the process
instruction memory size (though the data side is a bit tougher).

>Items marked * have a little more (but educated) guesswork than the others,
>which are computed with high confidence from extensive data.
>Items marked "+" are relevant to the 16-vs-32 issue.

>A1.	.336 <= 3 cycles of load latency
>	R2000 data: 21% of instruction cycles are loads (16-29%)
>		expected fill rates for load-delays are:
>		1: 70% (we got 68% on this bunch, and it contained one nasty
>			program that only filled 48%).
>		2: 30%
>		3: 10%

	Here another affect come into play: two address ALU instructions.  When
the compiler wants 3-address, it has to generate 2 ALU instructions.  This
means a slightly higher percentage of ALU ops to load/store ops, and therefor
better filling of load delays.  That should increase the 30% and 10% a fair
amount, but I don't know how much.  Also, ALU ops with prefixes will help
here to, filling otherwise useless cycles.

>	A1B: assuming that similar-quality optimizers and reorganizers are used

	See note above re: linker; otherwise OK.

>A2.+	.279 <= Loads/stores use 4-bit immediates
>		7.1% of the loads/stores could use a 4-bit offset (3.2-18%)
>		R2000 load/stores have 16-bit offsets, which require the RPM40
>		2 cycles to obtain:

	Note:  for word load/stores, that 4 bit immediate is shifted left 2,
effectively 6 bits in byte addressing.  Halfwords (minor) get shift left 1.
Also, what about >16 bit offsets?  Obviously, the R2000 will do a mov; add
immediate 32-bit; load sequence (well, a guess), so it hard to compare here.
Luckily, >16 bit offsets are rare (note the shifting comes into play again
here.)

>A3.+	.035 <= Add/sub immediates [.050]
>		The 3.5% assumes there are both add/sub immediate (i.e.,
>		4 bits + implied sign).  If there is really just add-immed,
>		and you get 3 bits + sign, use the number 5%

	I'm not sure I understand you, for reference the leftmost bit of the
immediate is extended.  All ALU ops can use immediates.

>A4.+	.013 <= Compare immediates (COND)
>		About 1.3% fit in 16-bits, but not 4-bits, and thus
>		require 1 PFX:

	Once again, what about >16 bits?  (Not TOO common, but not totally
uncommon either.

>A6.+	.011 <= Load-immediate
>	This is used to stick constants in registers for arguments,
>	compares, etc. (it's actually an add to zero, or something like that).
>	I assume the RPM-40 has an equivalent.

	It's MOV with the source immediate. (Remember, no 3-addr ALU ops)

>		About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus
>		require 1 PFX:

	> 16 Bits?

>A7.+	.018 <= Load-upper-immediate
>	This puts zeroes in low bits, and 16 bits in the top of a register.
>	It's often used for setting up 32-bit address, for example,
>	or a long constant.
>	R2000 data: 1.9% (.5-6.9%) are LUIs
>		About 1.8% (of the total) need the top 4 bits, and thus
>		would need an 2 PFXs, rather than 1 (or 4, rather than 3):

	One NEVER needs more than 3.  3 PFX's + 4 bits in the instruction =
40 bits.  2 PFX's + 4 bits = 28.  1 PFX + 4 bits = 16.  (this is for ALU ops
and loads/stores.  For branches/XP stuff, the instruction has more immediate
bits available (12).

>A8.	.024 <= Jump-and-link
>	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...

	Close, no cigar.  BRA; <fill>; MOV or STW.  This helps fill branch
delay, which otherwise might be tough (subroutine calls are worse than
other types of branches, usually).

>	R2000 data: 1.2% of the instruction cycles (.3->2.0%) are
>		calls (JAL, JALR), which give 28-bits of byte addressing.
>	On the RPM-40,  a JAL is done by a 2-instruction pair, which
>		I assume is a MOV of  the PC to the return-address register,
>		followed by a branch, or something equivalent, although I
>		haven't seen the exact sequence.  I assume that the return
>		jump is a single cycle.
>	Assuming programs of the size above, the natural code to
>		generate would be:
>		MOV (to save PC)
>		PFX
>		BRA
>	i.e., because 12-bits of displacement are enough for local branches,
>	(99+% on R2000), but not for globals. This adds 2 cycles/call.

	For internal calls, it's often enough.  Do you have data on call
distances?  I would suspect even in moderate sized programs it's at least
10%, maybe as high as 25%; and in small utilities 25%+.

>	Assuming that branches get an extra bit of addressabiltiy (halfword
>	alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code,
>	which is adequate for all of the cited programs, although not so
>	for others.  If by some chance the RPM uses a byte offset, getting

	Yup, 32 Meg, you got it.  Why generate bits that are always 0?

>	cycle-expansion: .012 * 2 = .024

	Net result: .012 * 1 = .012

>A9*.	.010 <= Load/store hazard
>	The RPM40 is supposed to have a load/store hazard if a store occurs
>	at the exact right number of cycles after a load.  This is not data
>	we keep (since we don't have this hazard), and it's hard to compute.
>	Here's a quick guess:

	It's a bit wierder than you suppose, but the number sounds OK.

>A12*.	.033 <= Misses due to branch-target-cache misse
>	According to the talk, there was supposed to be a hit-rate >90%.
>	I have no data on the kinds of programs used to calibrate that.
>	Big programs are clearly worse than little programs in this regard;

	Our data coimes from BIG vax application programs, if I remember
correctly.  (BIG!)
	With small programs it approaches 99%.
	I'm assuming this was talked about at ISSCC, since the cache in general
was: with a good optimizer/reorganizer, you can override the hardware control
of replacement of targets, and take over <some, all> of the targets for 
whatever addresses you like.  What this actually gets you in practice is a
good question.

	MAJOR BUG:
	Your number left something out: you must subtract whatever the R2000's
cache miss times penalty from the RPM40's.
	I'll assume it all equals out (we MAY win, may not).

>A13*.	.050 <= Lack of ALU forwarding
>	From all of the discussion, I can't tell how much bypassing the
>	RPM40 does, or doesn't do.  From the various hints, it sounds like:
>	a) you can store the output of an ALU op with no delay
>	b) you can't otherwise use the result of an immediately-preceding
>	ALU op.

	Correct.

>	I'll guess 5%, which I think is reasonably conservative:
>	cycle expansion: .05

	Seems OK.  As you said, prefixes help here.

>A14*.	.020 <= Multiply-divide
>	I'm not exactly sure how this is implemented on the RPM,
>	although it probably doesn't have a fast multiplier on the CPU.
>	(Maybe it's on the FPU, in which case some of this would go away).

	I won't say how we do it (yet), but 32x32 multiplies are 15 or 16
(can't remember exactly) cycles.  Divides take longer but about equivalent
to R2000.

>	R2000 data: across the benchmarks, about 2.7% of the time was
>	spent in multiply/divide interlock cycles [we have a 12/35 cycle
>	multiple/divider], which includes the effects of having some
>	instructions scheduled into the latency of an asynchronous unit.
>	Assume that the interlock cycles are split 50/50 (faster mults,
>	but more mults than divides, and that RPM's mults take about 3X,
>	i.e., +2 factor, you get:
>	cycle expansion: .013 * 2 = .026, which I'll round down to 2.

	So this drops to just a little.

>A15*.+	.040 <= 2-address registers, rather than 3-address ones

	Sounds OK, tough to figure numbers on (basic architecture diff).

>A16*.	.039 <= Less registers (21 instead of 32)
>	Note: this is more of an effect for large, complex programs than
>	small ones.  A cross-check is that you get the same number if
>	you assume even 1 more register is saved/restored on the average per
>	function call, (PFX+SW, PFX+LW for 4 cycles), with an average
>	of 100 instructions/call (representative).

	This also seems reasonable.

>A17*.	.020 <= Architectural Reorganization issues
>	Several of the times above related to reorganization:(A1, A9, A13).
>	A number of factors appear to make the RPM40 more difficult to
>	reorganize for:
>	1) PFX instructions are difficult, if not impossible to move
>	around away from the instructions prefixed (unlike the R2000's
>	style of using a bypassed GP register).

	True, though they do help a bit in filling load delays.

>	2) The instruction that sets the SR2 to get partial-word operations
>	is hard to move "too far" away from the instruction(s) that need it.

	Also true, luckily most subroutines only use one of {signed,unsigned}
{halfword,byte}, if they use any at all.  Also, global optimizers allow
better knowlege of what subroutine calls do to Sr2.

>	3) The load/store pipeline hazard must be taken care of.

	True, but it's easy.  In practice it doesn't hurt much.

>	4) If there is no forwarding, that has to be reorganized also.
>	Reorganization is very important for many RISC processors.
>	The RPM40 has a some extra things to worry about, and one less
>	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.

	Huh?  We have a two-cycle branch delay to fill (did you think we
had none???)  Experience on RPM40 shows it can be filled fairly well (good
place for stores, for example.)

>A18*.	?? <= Coprocessor issues
>	I haven't really touched on this very much, as we don't know much,
>	except that even a few cycles extra latency getting to a floating-point
>	unit can hurt a lot, except in applications that naturally pipeline
>	very well, or if the FPU has long cycle-count operations in the first
>	place.  XPLD without XPST is a little puzzling. 

	Note that load = approx 2x store frequency.  Should be obvious (and
there are other ways to load/store XP stuff.)  Suffice it to say, given the
envirionment we designed for, XP stuff was VERY carefully designed.  There
should be no loss, and probably even a gain, vs R2000.

>A19*.	-.067 <= Contraction issue (R2000 branch nops)
>	The R2000 loses 6-7% to unfilled branch delay slots,
>	which the RPM40 does not. (of course, the RPM40 takes hits in other
>	areas of branching, but we've already included them).

	Drop this (though we do get better than average stats for branch
fills, I predict, especially on CALL's).

>A20*	.010 <= Miscellaneous
>	There are a bunch of integer-related issues that I can only guess
>	at, but observing that there are 4 bits in the opcode field for
>	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
>	ops are found in the RPM, although I don't know which ones they
>	might be.  Also, if the immediate field encodes 16-bit shifts,
>	that will help, and hurt, if not.

	Actually, we had leftovers we had to figure out what to use for.
(one became RADD (I think this was my idea; Dennis, do you remember?))
	Shifts/rotates are like all other ALU ops re: immediates.

>Bottom line, given everything I know:

	I'll show results modified re: the above comments.

>A1.	.336 <= 3 cycles of load latency @
	.300 guess
>A2.+	.279 <= Loads/stores use 4-bit immediates @
	.240 guessed at % that could use a 6 bit immediate (effective)
>A3.+	.035 <= Add/sub immediates @
>A4.+	.013 <= Compare immediates @
>A5.+	.013 <= Logical immediates @
>A6.+	.011 <= Load-immediate @
>A7.+	.018 <= Load-upper-immediate @
	.001 for a process smaller than 28 bits of instructions, we don't
	     ever need to load the top 4 bits for address constants.  For
	     integer constants, if the top 5 are all 1 or 0, we don't need the
	     top 4 (almost always the case).
>A8.+	.024 <= Jump-and-link @
	.012 see above
>A9*.	.010 <= Load/store hazard
	.005 I think we can do better at avoiding the problem, loads tend to
	     cluster near beginnings of block, stores at the end.  Not big
	     either way.
>A10*.+	.098 <= Conditional branch
>A11*.+	.010 <= Partial-word load/store @
>A12*.	.033 <= Misses due to branch-target-cache misse
	.000 mistake - R2000 has misses too.
>A13*.	.050 <= Lack of ALU forwarding
>A14*.	.020 <= Multiply-divide
	.005 See above
>A15*.+	.040 <= 2-address registers, rather than 3-address ones
>A16*.+	.039 <= Less registers (21 instead of 32, sort of 
>A17*.	.020 <= Architectural Reorganization issues
>A18*.	?? <= Coprocessor issues
>A19*.	-.067 <= Contraction issue (R2000 branch nops)
	.000 Misconception
>A20*	.010 <= Miscellaneous
	.000 I don't really think we lose anything signifigant here
>
>Total	0.992	cycle expansion

Amusing: I came out with (after my mods above):
	0.892
Pretty close to what you have, all in all.

>	Thus, for cycle counts (ignoring cache-miss & MMU overhead),
>	a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e.,
>	it would run twice as many (instruction cycles + delay cycles).

	Effectively, I get close to the same (1.9 RPM cycles = 1 R2000 cycle).

>	20 / 1.39 = 14.4 VUP
>
>which is well inside  "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates
>given by various of the GE folks.

	Wow.  I guess we are pretty good guessers (actually, the numbers
came from things like this, but against 68000, 1750, etc, etc.)

>So far, all of this has been architectural, i.e., assuming that the
>RPM40's software was as close to the R2000's as possible, i.e.,
>what would it be if they had our compilers. It is hard to compare
>against something you don't know, but I would observe:
[stuff about the good MIPS compilers, estimates that equals 25% loss to rpm40]

	Well, since "GE is not in the computer business", I doubt we'll
ever know for sure.  But I'm not interested in what GE/USG does with it,
just what the architecture wins or loses.

>4. Conclusion

>This analysis is only relevant to running substantial programs
>of the kinds found on general-purpose machines.  The RPM may well
>do relatively better in more embedded-systems environments where the
>tradeoffs work better, and there is nothing wrong/irrelevant with those
>environments; they just aren't workstation environments, and whatever
>one learns in either one doesn't necessarily translate to the other,
>BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS.

	Yup!  I suspect that for the instruction mixes we optimized for,
we do somewhat better than the 1.9x I arrived at vs R2000 (BIG programs, but
different types of BIG programs (hint hint)).  Maybe much better (1x).

>I apologize for any errors in this analysis, which took a fair
>amount of time to put together, given the sketchiness of the info,
>but which, I believe, is accurate to within the correctness
>of the assumptions, and does reflect a modest amount of experience
>with such things.  Feel free to fix it if it's wrong, and if it matters,
>but as Mr. Jesup says, this has been beaten to death.

	The errors were minor, and helped to correct a few misconceptions
about what the rpm-40 instruction set does (ex: branches and CALL).  I
thank you for all the work you did on this, this is the type of thing I
read comp.arch for.  It is pretty definitive, no need to quibble over
minor points.  As I said in a previous message, we didn't optimize for
workstations.  We can do them, but it's a toss-up (confirmed by this
article) whether we're better in a workstation envirionment than the R2000.
Of course, in different cases we do better (even large cases, just different
problems/instruction mixes).  No suprise.  The R2000 does very well at
what it was designed for, Unix boxes and workstations, and we read whatever
articles you had published when we were designing the Rpm-40.

>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
>UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
>DDD:  	408-991-0253 or 408-720-1700, x253
>USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)