Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!ames!amdcad!bcase
From: bcase@amdcad.UUCP
Newsgroups: comp.arch,comp.lang.c
Subject: String Processing Instruction
Message-ID: <15292@amdcad.UUCP>
Date: Wed, 25-Mar-87 14:13:49 EST
Article-I.D.: amdcad.15292
Posted: Wed Mar 25 14:13:49 1987
Date-Received: Fri, 27-Mar-87 04:24:48 EST
Organization: AMDCAD, Sunnyvale, CA
Lines: 58
Xref: utgpu comp.arch:661 comp.lang.c:1325

There was a discussion a few months ago about processing strings more
efficiently than a byte at a time.  The Am29000 takes one of the possible
approaches to improving string processing performance....


One unique feature of the Am29000 architecture is a special instruction.
This instruction is intended to be used to speed-up string processing,
but my guess is that other uses will be discovered.  The instruction is
called "compare-bytes" and works like this:

Compare bytes specifies two source register operands and one destination
register operand.  The 4 pairs of corresponding bytes of the two 32-bit
source operands are compared for equality (i.e., the two most-significant
bytes are compared, the two next-most-significant bytes are compared, etc.).
If any of the four pairs are equal, then the destination register is set
to the value "TRUE" (which on the Am29000 is a one in the most-significant
bit with all other bits cleared to zero).  If none of the four pairs are
equal, then the destination register is set to "FALSE" (all bits cleared).
(Am29000 conditional branch instructions test only the most significant bit of
a register, condition codes are not used; we get a free "test for negative.")

So, if one of the source operands is set to all zeros (four null characters)
(which can be specified in the instruction by choosing the second operand
as the zero-extended eight-bit constant zero) and the other operand is a
word of the character string being dealt with (say for copying or comparing),
the Am29000 can, in one cycle (not counting the branch), determine if the word
contains the end of string character (according to the C language definition
of string).  If the word does not contain the end of string character, then
the four bytes in the word can be manipulated (e.g. loaded or stored) as a
unit.  Word operations on the Am29000 are much more efficient than character
operations (this is true of most machines though).

There are, of course, special circumstances to deal with (such as misaligned
strings, and we have a funnel shifter to help in those cases), but by using
the compare-bytes instruction in the library routines strcpy() and strcmp()
(and strlen() too, but we haven't bothered since it seems to never be used in
the programs we have encountered), significant improvements in the run-time
of many C programs can be realized.  Another thing which really helps is to
have the compiler word-align literal strings (and I have implemented this),
but even with word-alignment, some substrings will begin on strange boundaries
and must be dealt with correctly.

My approach to using this instruction consisted of re-writing the library
routines in C with function calls wherever the compare-bytes instruction
should go.  I compiled this C code with my compiler, changed the assembly code
to eliminate the function calls in favor of the compare-bytes instruction, and
assembled it into the library (actually a module of code that gets included in
all final links, but that is just a detail of our simple environment).  Since
most C programs (especially utilities and other systems programs) do a lot of
string processing, this one instruction is really worth the small
implementation cost.  It often improves run times by 15% to 20% (just goes to
show that the impact of processing C language strings has been long- ignored).
It implements just the right semantics and probably has other applications for
specialized pattern matching.  

I just thought some of you would be interested.

   bcase