Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!lll-lcc!pyramid!prls!mips!mash
From: mash@mips.UUCP
Newsgroups: comp.arch
Subject: Re: String Processing Instruction
Message-ID: <236@winchester.mips.UUCP>
Date: Sat, 28-Mar-87 16:49:18 EST
Article-I.D.: winchest.236
Posted: Sat Mar 28 16:49:18 1987
Date-Received: Sun, 29-Mar-87 11:28:34 EST
References: <15292@amdcad.UUCP> <978@ames.UUCP> <909@spar.SPAR.SLB.COM> <230@winchester.mips.UUCP> <15315@amdcad.UUCP>
Reply-To: mash@winchester.UUCP (John Mashey)
Distribution: na
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 107

In article <15315@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>So, I would like to turn the question around, John, and ask you to do
>what we might find difficult:  look at the dynamic execution of your
>compiler.
Will do.  For the time being, I did find one program that used them
enough to be interesting: (4.3BSD /bin/ls /bin /usr/bin)
function	%CPU	cycles/call
strlen		7.35	18
strcat		3.24	57
strcmp		2.95	11

Again: strcmp's clearly fail pretty quickly, and strlens's don't look at
very big strings [later, I'll show the dis-assembled code for the
strlen & strcmp functions (the others are a bit longer).

For shorter ls (like ls of current directory with 10 files), or for SYS V ls,
the numbers were down in the noise. GIANT CAVEAT: remember these are the
undegraded (for cache/tlb/memory interference) numbers, and even
more importantly, remember that these are for our architecture,
i.e., my previous estimates are for how much WE would save by ahving
the instruction.  Note that any architectural addition must always be
evaluated in the context of what's already in the architecture, i.e.,
it's not possible to SAY "such and such instruction is worth having"
without considering what it's being added onto.
For example, Brian's first posting on this topic didn't mention that the
AMD was a WORD-ADDRESSED machine, that comment was later.  It wouldn't
surprise me at all that the special string instruction is quite worthwhile
for such machines, even if it's probably [from the evidence I've got
so far] below the cutoff on ours [although still useful enough not to
reject without thought.]

>Sigh, it may be that there are just a few important programs that will
>benefit.  Out of the 100 most popular programs, maybe two benefit sub-
>stantially from the compare-bytes instruction.  But if those two are the
>most popular (or close to it), it might still be worthwhile to have the
>instruction.  Dhrystone is either important or not depending upon how
>much time you spend running benchmarks or how paranoid one is. :-)
As noted: the instruction is almost certainly worth something on the AMD part.
[The word-addressing thing is a whole separate issue that's worth a LOT
of discussion, but we might as well finish this one first.]
As usual, computer architecture needs a great deal of measurement,
because intuition can often be wrong.  You still need intuition and
experience for the stuff that's so hard to measure.
>
>Ok, so now Tim tells me that the assembler spends about 17% of its time
>in strcmp() (strcpy() is less than 1% in this case).  And note that this
>17% is *including* the effect of our compare-bytes instruction, without
Is it possible for you to post the code for strcmp/strlen?
(It would be especially interesting to see how you handle pointers
to bytes & shorts).

Rest of this is details of strlen/strcmp mentioned above.
	strlen:
0x404000:	addiu	v1,a0,1
0x404004:	lbu	v0,0(a0)
0x404008:	addiu	a0,a0,1
0x40400c:	bne	v0,zero,0x404004
0x404010:	nop
0x404014:	jr	ra
0x404018:	subu	v0,a0,v1
In ls, this was 7.35%, 18 cycles/call.  This says that the mean case
is addiu, 4 times around the loop (4 instrs each), plus the 2 instrs
(jr, subu) on exit. I.e., the general cost for this routine is 3 + 4N
(where N is the size).  All of this implies what the size distribution must
be: highly skewed, with a mode of 1-2, and a long tail towards longer
strings.  Thus, the tradeoff is between a routine with minimal setup that
runs quickly on short strings, but is not as efficient on longer strings,
versus ones that uses more setup. [As you can see, proper design of all
this REALLY depends on knowing the statistics: we wrote several versions
of each of these for comparison.  Again, I wish somebody would do
a definitive study!]

	strcmp:
0x403f50:	lbu	t0,0(a0)				|
0x403f54:	lbu	t1,0(a1)				|
0x403f58:	beq	t0,zero,0x403f84			|-> null str, ex
0x403f5c:	addi	a0,a0,2				|
0x403f60:	bne	t0,t1,0x403f8c			|-> 1st char !=, exit
0x403f64:	lbu	t2,-1(a0)		|
0x403f68:	lbu	t1,1(a1)		|
0x403f6c:	beq	t2,zero,0x403f84	|-> 2nd char null, exit
0x403f70:	addi	a1,a1,2		|
0x403f74:	beq	t2,t1,0x403f54	|---^ characters equal, loop
0x403f78:	lbu	t0,0(a0)
0x403f7c:	jr	ra
0x403f80:	subu	v0,t2,t1
0x403f84:	jr	ra
0x403f88:	subu	v0,zero,t1
0x403f8c:	jr	ra
0x403f90:	subu	v0,t0,t1

This is a 2-way unrolled version of the simplest strcmp.  In ls,
this was 2.95% of the instruction cycles, 11 cycles/call average.
There seem to be 2 "likely" cases here: 
	11 instrs: take the "2nd char null, exit" branch and return
	13 instrs: find 2nd char !=, i.e., execute 13 instrs with no branch
What this says is that once again, there's a very skewed distribution,
since some longer strings must compare equal. [To be honest, this
even surprises me somewhat].

IF this is representative, this says that the special instruction
wouldn't do US much good on these routines. [It might help on
strcpy/strcat: I'll look at that later.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086