Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!seismo!mcvax!jack
From: jack@mcvax.UUCP
Newsgroups: comp.arch,comp.lang.c
Subject: Re: String Processing Instruction
Message-ID: <7349@boring.mcvax.cwi.nl>
Date: Wed, 15-Apr-87 15:01:39 EST
Article-I.D.: boring.7349
Posted: Wed Apr 15 15:01:39 1987
Date-Received: Fri, 17-Apr-87 03:16:00 EST
References: <15292@amdcad.UUCP> <693@jenny.cl.cam.ac.uk>
Reply-To: jack@boring.UUCP (Jack Jansen)
Organization: AMOEBA project, CWI, Amsterdam
Lines: 63
Xref: utgpu comp.arch:883 comp.lang.c:1627

In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>You might be interested to know that such detection of null bytes in words
>can be done in 3 or 4 instructions on almost any hardware (nay even in C).
>(Code that follows relies on x being a 32 bit unsigned (or 2's complement
>int with overflow ignored)...)
>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>Then if e is an expression without side effects (e.g. variable)
>   has_nullbyte_(e)
>is nonzero iff the value of e has a null byte.

I was so impressed by this new trick (well, to *me* it is new:-)
that I immedeately decided to try it. my Whitechapel MG-1,
a 32016 based machine, the results were impressive.

I coded strcpy() using this methods, and the results were
great. Break-even with normal strcpy() at 4-char strings, performance
slightly worse with 5/6/7-char strings, and getting better and
better from there on. For strings with length 4N (N>=4) performance
was twice that from old strcpy(). This is the routine:


#define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080)

strcpy(at,f)
    long *at;
    register long *f;
{
    register long d;
    register long *t = at;
    register char *fc, *tc;

    do {
	d = *f++;
	if( !hasnull(d) ) {
	    *t++ = d;
	    continue;
	}
	tc = (char *)t;
	fc = (char *)(f-1);
	while( *tc++ = *fc++);
	return;
    } while(1);
    return(at);
}

Coding in assembler caused a 30% decrease in time for small (10-char)
strings (less registers to save, t/tc and f/fc in the same reg, etc).
Something I haven't explained yet is that unaligned strings give the
*same* performance. Maybe the extra fetches are noise wrt the
number of instruction fetches?

Note that the 32016 is a 32 bit machine with a 16 bit bus, 
so that is probably why I found twice the speed, in stead of four
times.

Anyway, the next thing I thought of is "Wow! This is *great* for
strcmp() on big-endians. Comparing 4 bytes in one go through
the loop!". But, of course, I don't have a big-endian handy.

Anyone care to try this?
-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.