Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!seismo!mcvax!jack From: jack@mcvax.UUCP Newsgroups: comp.arch,comp.lang.c Subject: Re: String Processing Instruction Message-ID: <7349@boring.mcvax.cwi.nl> Date: Wed, 15-Apr-87 15:01:39 EST Article-I.D.: boring.7349 Posted: Wed Apr 15 15:01:39 1987 Date-Received: Fri, 17-Apr-87 03:16:00 EST References: <15292@amdcad.UUCP> <693@jenny.cl.cam.ac.uk> Reply-To: jack@boring.UUCP (Jack Jansen) Organization: AMOEBA project, CWI, Amsterdam Lines: 63 Xref: utgpu comp.arch:883 comp.lang.c:1627 In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >You might be interested to know that such detection of null bytes in words >can be done in 3 or 4 instructions on almost any hardware (nay even in C). >(Code that follows relies on x being a 32 bit unsigned (or 2's complement >int with overflow ignored)...) > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >Then if e is an expression without side effects (e.g. variable) > has_nullbyte_(e) >is nonzero iff the value of e has a null byte. I was so impressed by this new trick (well, to *me* it is new:-) that I immedeately decided to try it. my Whitechapel MG-1, a 32016 based machine, the results were impressive. I coded strcpy() using this methods, and the results were great. Break-even with normal strcpy() at 4-char strings, performance slightly worse with 5/6/7-char strings, and getting better and better from there on. For strings with length 4N (N>=4) performance was twice that from old strcpy(). This is the routine: #define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080) strcpy(at,f) long *at; register long *f; { register long d; register long *t = at; register char *fc, *tc; do { d = *f++; if( !hasnull(d) ) { *t++ = d; continue; } tc = (char *)t; fc = (char *)(f-1); while( *tc++ = *fc++); return; } while(1); return(at); } Coding in assembler caused a 30% decrease in time for small (10-char) strings (less registers to save, t/tc and f/fc in the same reg, etc). Something I haven't explained yet is that unaligned strings give the *same* performance. Maybe the extra fetches are noise wrt the number of instruction fetches? Note that the 32016 is a 32 bit machine with a 16 bit bus, so that is probably why I found twice the speed, in stead of four times. Anyway, the next thing I thought of is "Wow! This is *great* for strcmp() on big-endians. Comparing 4 bytes in one go through the loop!". But, of course, I don't have a big-endian handy. Anyone care to try this? -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.