Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!ll-xn!ames!amdcad!bcase From: bcase@amdcad.UUCP Newsgroups: comp.arch,comp.lang.c Subject: Re: String Processing Instruction Message-ID: <15313@amdcad.UUCP> Date: Fri, 27-Mar-87 18:01:01 EST Article-I.D.: amdcad.15313 Posted: Fri Mar 27 18:01:01 1987 Date-Received: Sat, 28-Mar-87 16:35:53 EST References: <15292@amdcad.UUCP> <1001@ames.UUCP> Reply-To: bcase@amdcad.UUCP (Brian Case) Organization: Advanced Micro Devices, Sunnyvale, California Lines: 68 Xref: utgpu comp.arch:684 comp.lang.c:1358 In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes: >just curious which unix utilities use str(cpy|cmp|len) in their inner loops? >certainly, 'vn' comes to mind as devoting much cpu time to these functions. > >is the 15-20% claimed due at all to either > > (a) significantly slow byte addressing to begin with (ala cray)? > (b) in-line compilation of the string(3) stuff into the application? > >if (a), then improving memory byte access speed in the architecture is a >more general solution with more payoff overall than the compare gate hack. >what is the risc chip cost for byte vs. word addressibility, anyway? > >if (b), then maybe function call speed is the culprit rather than dearth of >the specialized instruction. > >at any rate, >for cray unix, buffer copy ops in the kernel were vastly improved when >re-written for words instead of bytes, even more so when vectorized >(the only place in the kernel with vectorization, i think). > >of course, table lookup using only 2^16 locations would be a joke >software solution for super-intensive null-char-in-16-bit-smallword >compare code. drastic, but saves a test the amd chip appears worried about. >personally, i'm a fan of branch folding ... > Re: (a) above. NO. Implementing a byte addressable (especially writable) memory slows down all memory accesses for a slight improvement in byte processing efficiency. For my own part, I can say that from the beginning of the Am29000 project, I was firmly against anything but a word-oriented interface. You have to realize that byte accessing requires an alignment network somewhere: it will add some nanoseconds to all memory accesses; you can put the alignment network in its own pipeline stage, but even then, it will *always* slow down every memory access, there is nothing you can do about it! (The same reasoning leads to our addressing-mode-less load and store instructions: if addressing modes are there, then the instructions always pay the cost even when the compiler knows better!) Thus, this is not strictly a RISC issue, but a maximum through-put issue regardless of RISCness. (Note that the Am29000 extract-byte and insert-byte instructions are essentially putting the alignment network in a pipeline stage, but the compiler can decided to pay or not pay the penalty for a given memory access (depending upon whether it is a byte access or not)). Re: (b) above. NO. We do not do inlining of the strcmp(), strcpy(), or strlen() routines (I wish we could, it would be even better!). The Am29000 has one of the fastest calling conventions around. The performance improvement we saw with the special instruction is an order of magnatude (ok, I'm taking a guess here, but it is probably pretty close) greater than what would have been gained by in-lining in this case. Re: Branch folding. I like all optimizations. Give me more MIPS! More. MORE. MORE! bcase --------------- Back again by popular demand: Sung to the tune of "Timewarp:" It's just a shift the left, And then a shift to the ri-i-ight! Get a load of those MIPS And the code's real ti-i-ght! This lock-step pipeline Really drives me insa-a-a-e-a-ane Let's do the I-fetch again!