Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!ll-xn!ames!amdcad!bcase
From: bcase@amdcad.UUCP
Newsgroups: comp.arch,comp.lang.c
Subject: Re: String Processing Instruction
Message-ID: <15313@amdcad.UUCP>
Date: Fri, 27-Mar-87 18:01:01 EST
Article-I.D.: amdcad.15313
Posted: Fri Mar 27 18:01:01 1987
Date-Received: Sat, 28-Mar-87 16:35:53 EST
References: <15292@amdcad.UUCP> <1001@ames.UUCP>
Reply-To: bcase@amdcad.UUCP (Brian Case)
Organization: Advanced Micro Devices, Sunnyvale, California
Lines: 68
Xref: utgpu comp.arch:684 comp.lang.c:1358

In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes:
>just curious which unix utilities use str(cpy|cmp|len) in their inner loops?
>certainly, 'vn' comes to mind as devoting much cpu time to these functions.
>
>is the 15-20% claimed due at all to either
>
>	(a) significantly slow byte addressing to begin with (ala cray)?
>	(b) in-line compilation of the string(3) stuff into the application?
>
>if (a), then improving memory byte access speed in the architecture is a
>more general solution with more payoff overall than the compare gate hack.
>what is the risc chip cost for byte vs. word addressibility, anyway?
>
>if (b), then maybe function call speed is the culprit rather than dearth of
>the specialized instruction.
>
>at any rate,
>for cray unix, buffer copy ops in the kernel were vastly improved when
>re-written for words instead of bytes, even more so when vectorized
>(the only place in the kernel with vectorization, i think).
>
>of course, table lookup using only 2^16 locations would be a joke
>software solution for super-intensive null-char-in-16-bit-smallword
>compare code.  drastic, but saves a test the amd chip appears worried about.
>personally, i'm a fan of branch folding ...
>

Re:  (a) above.  NO.  Implementing a byte addressable (especially writable)
memory slows down all memory accesses for a slight improvement in byte
processing efficiency.  For my own part, I can say that from the beginning
of the Am29000 project, I was firmly against anything but a word-oriented
interface.  You have to realize that byte accessing requires an alignment
network somewhere:  it will add some nanoseconds to all memory accesses; you
can put the alignment network in its own pipeline stage, but even then, it
will *always* slow down every memory access, there is nothing you can do about
it!  (The same reasoning leads to our addressing-mode-less load and store
instructions:  if addressing modes are there, then the instructions always
pay the cost even when the compiler knows better!)  Thus, this is not strictly
a RISC issue, but a maximum through-put issue regardless of RISCness.  (Note
that the Am29000 extract-byte and insert-byte instructions are essentially
putting the alignment network in a pipeline stage, but the compiler can
decided to pay or not pay the penalty for a given memory access (depending
upon whether it is a byte access or not)).

Re:  (b) above.  NO.  We do not do inlining of the strcmp(), strcpy(), or
strlen() routines (I wish we could, it would be even better!).  The Am29000
has one of the fastest calling conventions around.  The performance
improvement we saw with the special instruction is an order of magnatude
(ok, I'm taking a guess here, but it is probably pretty close) greater than
what would have been gained by in-lining in this case.

Re:  Branch folding.  I like all optimizations.  Give me more MIPS!  More.
MORE.  MORE!

    bcase
---------------

Back again by popular demand:

Sung to the tune of "Timewarp:"

	It's just a shift the left,
	And then a shift to the ri-i-ight!
	Get a load of those MIPS
	And the code's real ti-i-ght!
	This lock-step pipeline
	Really drives me insa-a-a-e-a-ane
	Let's do the I-fetch again!