Path: utzoo!mnetor!uunet!husc6!cmcl2!nrl-cmf!ames!lll-lcc!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Architectural analysis of RPM-40 for general usage [very long]
Message-ID: <1878@winchester.mips.COM>
Date: 16 Mar 88 04:35:13 GMT
References: <1840@winchester.mips.COM> <514@imagine.PAWL.RPI.EDU>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 238
Keywords: benchmarks architecture RISC

In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
...
>	And I thank you for it.  Numbers/real comparison stuff is great,
>"my chip is better than yours" stuff get old fast.  I'll keep my comments
>to a minimum, given the length of the original posting.

Yes, this is much more useful.  Thanx for the clarifications.  This posting
is mostly comments in response to your comments where I wasn't sufficiently
clear on the first posting.

>>A1.	.336 <= 3 cycles of load latency
..
>	Here another affect come into play: two address ALU instructions.  When
>the compiler wants 3-address, it has to generate 2 ALU instructions.  This
>means a slightly higher percentage of ALU ops to load/store ops, and therefor
>better filling of load delays.  That should increase the 30% and 10% a fair
>amount, but I don't know how much.  Also, ALU ops with prefixes will help
>here to, filling otherwise useless cycles.
Yes, this should help a little. The specific cases are the R2000 ones where
an lnop is immediately followed by:
	a) a 3-register ALU op
	b) an ALU op that uses an immediate >4 bits.
I have no data to figure out how often those are.  From looking at code,
it is sad, but true, that many lnops are structurally there almost no
matter what the architecture&compiler system are (with possible exceptions
of the VLIW systems).  I.e., they are the things that result from code like:
	if ((p != NULL) && (p->thing)) p = p->nextlink;
which gives code like:
	lw	reg1,p
	nop
	beq	reg1,0,1f
	nop
	lw	reg2,thingoffset(p1)
	nop
	beq	reg2,0,1f
	nop
	lw	reg1,nextlink(reg1)
	nop
1:
	maybe you get to fill the branch delays by grabbing the instruction
	at 1: in, (in fact, we often find a LUI or LI there and move it),
	but after looking at a lot of code, I can't for the life
	of me figure out how you frequently get to fill most of the load-nops,
	in an R2000, an RPM-40, or most other RISCs I've seen...UNIX kernel
	code is just filled with this kind of thing.

>>A2.+	.279 <= Loads/stores use 4-bit immediates

>	Note:  for word load/stores, that 4 bit immediate is shifted left 2,
>effectively 6 bits in byte addressing.  Halfwords (minor) get shift left 1.
>Also, what about >16 bit offsets?  Obviously, the R2000 will do a mov; add
>immediate 32-bit; load sequence (well, a guess), so it hard to compare here.
>Luckily, >16 bit offsets are rare (note the shifting comes into play again
>here.)
(Actually, what we do is:  lui reg1,address<<16; lw reg2,address>>16(reg1).
There was an article in ASPLOS this summer that talked about addressing).
I already counted this in the LUI analysis that was later, i.e., I had no
way to disentangle why LUI's were there, and only have the data on
an instruction-by-instruction basis for the offsets.  The extra 2 bits for
words definitely help (especially in non-stack, non-GP references),
so I'd buy the .240.

>>A3.+	.035 <= Add/sub immediates [.050]
>>		The 3.5% assumes there are both add/sub immediate (i.e.,
>>		4 bits + implied sign).  If there is really just add-immed,
>>		and you get 3 bits + sign, use the number 5%
>	I'm not sure I understand you, for reference the leftmost bit of the
>immediate is extended.  All ALU ops can use immediates.
That helps clarify it: the number should be .050.  The .035 included the
possibility that for add/subtract, you interpreted the 4-bit immediate as
a zero-extended (i.e., positive) number, in order to get 1 more bit of
immediate field, since the add/sub immediates overlap if you sign-extend
the immediate.  The .050 guess assumes you can add [-8..7] and subtract
[-8..7], or add/subtract [-8..8] by flipping add/sub as needed.  The .035
number assumed that you could get to [-15..15].

>	One NEVER needs more than 3.  3 PFX's + 4 bits in the instruction =
>40 bits.  2 PFX's + 4 bits = 28.  1 PFX + 4 bits = 16.  (this is for ALU ops
>and loads/stores....
Oops, I computed it right, but said it wrong: (I was thinking of the 4-bits
as a PFX)....

>>A8.	.024 <= Jump-and-link
>>	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...
>	Close, no cigar.  BRA; <fill>; MOV or STW.  This helps fill branch
>delay, which otherwise might be tough (subroutine calls are worse than
>other types of branches, usually).

(Actually, JAL's fill pretty well with argument-prep instructions.)
Thanx for the clarification: am I still wrong to think that there
are 3 cycles (of actual work) needed for a large-size branch-and-link?
(Ignoring branch-delay slots for the moment).
...
>	For internal calls, it's often enough.  Do you have data on call
>distances?  I would suspect even in moderate sized programs it's at least
>10%, maybe as high as 25%; and in small utilities 25%+.
Call distances is something I don't have in the standard reports.
...
>	MAJOR BUG:
>	Your number left something out: you must subtract whatever the R2000's
>cache miss times penalty from the RPM40's.
> 	I'll assume it all equals out (we MAY win, may not).

(I included the generic cache-miss overhead in the factor that converted
cycles->VUPS, and assuming that the RPM was using SRAM as a cache,
as needed in a general-purpose environment. See the summary for more info.
The R2000's single branch-delay slot covers the time to access to I-cache.)
...
>>	2) The instruction that sets the SR2 to get partial-word operations
>>	is hard to move "too far" away from the instruction(s) that need it.
>
>	Also true, luckily most subroutines only use one of {signed,unsigned}
>{halfword,byte}, if they use any at all.  Also, global optimizers allow
>better knowlege of what subroutine calls do to Sr2.

Many user programs don't use halfwords.  Systems-type programs mix bytes
and halfwords a lot.  (experience, but no data, unfortunately).

>>	4) If there is no forwarding, that has to be reorganized also.
>>	Reorganization is very important for many RISC processors.
>>	The RPM40 has a some extra things to worry about, and one less
>>	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.

>	Huh?  We have a two-cycle branch delay to fill (did you think we
>had none???)  Experience on RPM40 shows it can be filled fairly well (good
>place for stores, for example.)
(I thought you might have one, but I didn't have any data.  What I've heard
(from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly
hard to fill.  We agree that stores often go into the branch delay.)

As can be seen, reorganization statistics are nontrivial to estimate!
...
>	Note that load = approx 2x store frequency.  Should be obvious (and
>there are other ways to load/store XP stuff.)  Suffice it to say, given the
>envirionment we designed for, XP stuff was VERY carefully designed.  There
>should be no loss, and probably even a gain, vs R2000.

(Possible: we'll know when we see it, although the R2000/R2010 is a very
low latency design, with the coprocessor watching the instruction stream,
and doing direct loads/stores at the right times.)

>>A20*	.010 <= Miscellaneous
>>	There are a bunch of integer-related issues that I can only guess
>>	at, but observing that there are 4 bits in the opcode field for
>>	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
>>	ops are found in the RPM, although I don't know which ones they
>>	might be.  Also, if the immediate field encodes 16-bit shifts,
>>	that will help, and hurt, if not.
>
>	Actually, we had leftovers we had to figure out what to use for.
>(one became RADD (I think this was my idea; Dennis, do you remember?))
>	Shifts/rotates are like all other ALU ops re: immediates.

What I meant was: can you do left-shift 16 or right-shift 16 in a single
cycle, with no PFX (depends on how you interpret the immediate field)?
Not that it matters a lot, but it probably goes up to .005 if you can do
shifts by 15, but not 16, in one cycle.  Rather than wasting space on
torturous reasoning on the opcode field, I'll wait until the opcode
list can be published.

>>A1.	.336 <= 3 cycles of load latency @
>	.300 guess [OK]
>>A2.+	.279 <= Loads/stores use 4-bit immediates @
>	.240 guessed at % that could use a 6 bit immediate (effective) [OK]
>>A3.+	.050 <= Add/sub immediates @
>>A4.+	.013 <= Compare immediates @
>>A5.+	.013 <= Logical immediates @
>>A6.+	.011 <= Load-immediate @
>>A7.+	.018 <= Load-upper-immediate @
>	.001 for a process smaller than 28 bits of instructions, we don't
>	     ever need to load the top 4 bits for address constants.  For
>	     integer constants, if the top 5 are all 1 or 0, we don't need the
>	     top 4 (almost always the case). 
	Actually, almost all of these are for data-address, or logical
	constants/masks, etc. They're almost never for instructions, so I'd
	stick with the original number: .018.
>>A8.+	.024 <= Jump-and-link @
>	.012 see above
	I still didn't quite understand the change here, as I still think
	there are 3 cycles (PFX;BRA;fill;MOV)  where the R2000 would have
	(JAL;fill), with both architectures having the same sorts of things
	moving into the fill.  There is room for discussion on the fill issue,
	and on if you could use short branches in practice. Try:
	.020

>>A9*.	.010 <= Load/store hazard
>	.005 I think we can do better at avoiding the problem, loads tend to
>	     cluster near beginnings of block, stores at the end.  Not big
>	     either way.
		OK, hard to guess.
>>A10*.+	.098 <= Conditional branch
>>A11*.+	.010 <= Partial-word load/store @
>>A12*.	.033 <= Misses due to branch-target-cache misse
>	.000 mistake - R2000 has misses too.
	Actually not, or rather: the R2000 has cache misses in its external
	cache; the RPM has cache misses in its internal cache + misses in
	the external cache (this was assuming a design where the 64KI+64KD
	memories were caches, rather than the only memory.  The .033 here was
	the EXTRA penalty for taking internal cache misses (which might well
	be external cache hits).  I subsumed all of the external cache missing
	into the cycles->VUP conversion, since I didn't have a better way
	to get at it, and all of the cycle counts so far were independent
	of external cache designs. (Ask again if this doesn't make sense).
	As noted, the .033 number depended on the 90% rate, and would go up
	or down depending on the environment, but the number definitely is
	not zero, unless I misunderstand how the RPM works.
	.033

>>A13*.	.050 <= Lack of ALU forwarding
>>A14*.	.020 <= Multiply-divide
>	.005 See above
		OK, I can buy this.
>>A15*.+	.040 <= 2-address registers, rather than 3-address ones
>>A16*.+	.039 <= Less registers (21 instead of 32, sort of 
>>A17*.	.020 <= Architectural Reorganization issues
>>A18*.	?? <= Coprocessor issues
>>A19*.	-.067 <= Contraction issue (R2000 branch nops)
>	.000 Misconception
		OK.
>>A20*	.010 <= Miscellaneous
>	.000 I don't really think we lose anything signifigant here
	[OK, if you can do shift left/right 16 in 1 cycle, else .05].
>>Total	0.992	cycle expansion
>Amusing: I came out with (after my mods above):
>	0.892
>Pretty close to what you have, all in all.

Your revisions, plus my revisions to your revisions come out at .965.
My guess is that, unless there's some fundamental misunderstanding left,
that there is probably 90% confidence that the most programs of the sorts
modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!).
Thanx for the corrections; I think A8 & A12 are the only spots that
might need more clarification, if this last wasn't right.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086