Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!haven!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.lang.c Subject: Re: faster bcopy using duffs device (source) Keywords: loop unrolling, optimize, hacks Message-ID: <19491@mimsy.UUCP> Date: 9 Sep 89 05:02:12 GMT References: <5180@portia.Stanford.EDU> <19473@mimsy.UUCP> <5603@thor.acc.stolaf.edu> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 106 >In article <19473@mimsy.UUCP> I suggest that >>bcopy() should be written in assembly (on most processors), put in >>a library, and forgotten about .... In article <5603@thor.acc.stolaf.edu> mike@thor.acc.stolaf.edu (Mike Haertel) writes: >I just tried the obvious bcopy "while (n--) *s++ = *d++;" >on a 68010 using gcc. It produced a dbra loop that beat >the sh*t out of the supposedly carefully handcoded one >in the C library. (Which is a Duffish sort of thing that >tries to copy fullwords at a time. Not Duff's device, >but structurally similar.) Alas, gcc is rather the exception, although this is changing (more vendors are discovering gcc!), and it is disappointing how many supposedly carefully handcoded simple-but-time-consuming routines (bcopy, bzero, bcmp; or on SysV, memmove, memset, memcmp) were actually thrown together by someone who had no idea what was optimal. Just for fun, here is my bcopy-in-C for a 68010, assuming you have a compiler that is as smart as gcc. (In fact, it has been tested with gcc, and, at least with the gcc on our Suns, compilers very nicely, aside from one #ifdef. Curiously, the gcc-specific hack is necessary only for one of the two loops.) /* * Copyright (c) 1989 by the University of Maryland Computer Science * Department. All rights reserved. Permission to copy for any * purpose is hereby granted so long as the copyright notice remains * intact. */ #ifdef test typedef unsigned int size_t; #else #include #endif /* * Block copy from src to dst, len bytes. * Allows misaligned copies (with attendant performance penalty) * and allows overlapping regions. * * We make the asumption that cycles are * as fast as cycles so that we can move * shortwords at a time. If this is not true (e.g., if writes are * done `instantly' through a write buffer), this is suboptimal, * and the loop should be changed to move longwords at a time. */ void bcopy(register char *src, register char *dst, register size_t len) { register size_t l2; if (len == 0) return; /* check for potential overlap */ if ((unsigned long)src + len > (unsigned long)dst) { /* copy backwards */ src += len; dst += len; /* get operands word-aligned */ if ((long)src & 1) *--dst = *--src; if ((long)dst & 1) { /* cannot word-align both operands */ while (--len != (size_t)-1) *--dst = *--src; return; } /* copy words until no words left */ l2 = len >> 1; while (--l2 != (size_t)-1) { #ifndef __GNUC__ /* for some reason, gcc does not optimise this */ *(short *)dst = *(short *)src; dst -= sizeof (short); src -= sizeof (short); #else /* so we use a gcc extension */ *--(short *)dst = *--(short *)src; #endif } /* copy last byte, if any */ if (len & 1) *--dst = *--src; } else { /* copy forwards, otherwise identical */ if ((long)src & 1) *dst++ = *src++; if ((long)dst & 1) { while (--len != (size_t)-1) *dst++ = *src++; return; } l2 = len >> 1; while (--l2 != (size_t)-1) { *(short *)dst = *(short *)src; dst += sizeof (short); src += sizeof (short); } if (len & 1) *dst = *src; } } -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris