Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!lll-lcc!pyramid!prls!mips!mash From: mash@mips.UUCP Newsgroups: comp.arch Subject: Re: String Processing Instruction Message-ID: <236@winchester.mips.UUCP> Date: Sat, 28-Mar-87 16:49:18 EST Article-I.D.: winchest.236 Posted: Sat Mar 28 16:49:18 1987 Date-Received: Sun, 29-Mar-87 11:28:34 EST References: <15292@amdcad.UUCP> <978@ames.UUCP> <909@spar.SPAR.SLB.COM> <230@winchester.mips.UUCP> <15315@amdcad.UUCP> Reply-To: mash@winchester.UUCP (John Mashey) Distribution: na Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 107 In article <15315@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >So, I would like to turn the question around, John, and ask you to do >what we might find difficult: look at the dynamic execution of your >compiler. Will do. For the time being, I did find one program that used them enough to be interesting: (4.3BSD /bin/ls /bin /usr/bin) function %CPU cycles/call strlen 7.35 18 strcat 3.24 57 strcmp 2.95 11 Again: strcmp's clearly fail pretty quickly, and strlens's don't look at very big strings [later, I'll show the dis-assembled code for the strlen & strcmp functions (the others are a bit longer). For shorter ls (like ls of current directory with 10 files), or for SYS V ls, the numbers were down in the noise. GIANT CAVEAT: remember these are the undegraded (for cache/tlb/memory interference) numbers, and even more importantly, remember that these are for our architecture, i.e., my previous estimates are for how much WE would save by ahving the instruction. Note that any architectural addition must always be evaluated in the context of what's already in the architecture, i.e., it's not possible to SAY "such and such instruction is worth having" without considering what it's being added onto. For example, Brian's first posting on this topic didn't mention that the AMD was a WORD-ADDRESSED machine, that comment was later. It wouldn't surprise me at all that the special string instruction is quite worthwhile for such machines, even if it's probably [from the evidence I've got so far] below the cutoff on ours [although still useful enough not to reject without thought.] >Sigh, it may be that there are just a few important programs that will >benefit. Out of the 100 most popular programs, maybe two benefit sub- >stantially from the compare-bytes instruction. But if those two are the >most popular (or close to it), it might still be worthwhile to have the >instruction. Dhrystone is either important or not depending upon how >much time you spend running benchmarks or how paranoid one is. :-) As noted: the instruction is almost certainly worth something on the AMD part. [The word-addressing thing is a whole separate issue that's worth a LOT of discussion, but we might as well finish this one first.] As usual, computer architecture needs a great deal of measurement, because intuition can often be wrong. You still need intuition and experience for the stuff that's so hard to measure. > >Ok, so now Tim tells me that the assembler spends about 17% of its time >in strcmp() (strcpy() is less than 1% in this case). And note that this >17% is *including* the effect of our compare-bytes instruction, without Is it possible for you to post the code for strcmp/strlen? (It would be especially interesting to see how you handle pointers to bytes & shorts). Rest of this is details of strlen/strcmp mentioned above. strlen: 0x404000: addiu v1,a0,1 0x404004: lbu v0,0(a0) 0x404008: addiu a0,a0,1 0x40400c: bne v0,zero,0x404004 0x404010: nop 0x404014: jr ra 0x404018: subu v0,a0,v1 In ls, this was 7.35%, 18 cycles/call. This says that the mean case is addiu, 4 times around the loop (4 instrs each), plus the 2 instrs (jr, subu) on exit. I.e., the general cost for this routine is 3 + 4N (where N is the size). All of this implies what the size distribution must be: highly skewed, with a mode of 1-2, and a long tail towards longer strings. Thus, the tradeoff is between a routine with minimal setup that runs quickly on short strings, but is not as efficient on longer strings, versus ones that uses more setup. [As you can see, proper design of all this REALLY depends on knowing the statistics: we wrote several versions of each of these for comparison. Again, I wish somebody would do a definitive study!] strcmp: 0x403f50: lbu t0,0(a0) | 0x403f54: lbu t1,0(a1) | 0x403f58: beq t0,zero,0x403f84 |-> null str, ex 0x403f5c: addi a0,a0,2 | 0x403f60: bne t0,t1,0x403f8c |-> 1st char !=, exit 0x403f64: lbu t2,-1(a0) | 0x403f68: lbu t1,1(a1) | 0x403f6c: beq t2,zero,0x403f84 |-> 2nd char null, exit 0x403f70: addi a1,a1,2 | 0x403f74: beq t2,t1,0x403f54 |---^ characters equal, loop 0x403f78: lbu t0,0(a0) 0x403f7c: jr ra 0x403f80: subu v0,t2,t1 0x403f84: jr ra 0x403f88: subu v0,zero,t1 0x403f8c: jr ra 0x403f90: subu v0,t0,t1 This is a 2-way unrolled version of the simplest strcmp. In ls, this was 2.95% of the instruction cycles, 11 cycles/call average. There seem to be 2 "likely" cases here: 11 instrs: take the "2nd char null, exit" branch and return 13 instrs: find 2nd char !=, i.e., execute 13 instrs with no branch What this says is that once again, there's a very skewed distribution, since some longer strings must compare equal. [To be honest, this even surprises me somewhat]. IF this is representative, this says that the special instruction wouldn't do US much good on these routines. [It might help on strcpy/strcat: I'll look at that later.] -- -john mashey DISCLAIMER: UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086