Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!haven!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.lang.c
Subject: Re: faster bcopy using duffs device (source)
Keywords: loop unrolling, optimize, hacks
Message-ID: <19491@mimsy.UUCP>
Date: 9 Sep 89 05:02:12 GMT
References: <5180@portia.Stanford.EDU> <19473@mimsy.UUCP> <5603@thor.acc.stolaf.edu>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 106

>In article <19473@mimsy.UUCP> I suggest that
>>bcopy() should be written in assembly (on most processors), put in
>>a library, and forgotten about ....

In article <5603@thor.acc.stolaf.edu> mike@thor.acc.stolaf.edu (Mike Haertel)
writes:
>I just tried the obvious bcopy "while (n--) *s++ = *d++;"
>on a 68010 using gcc.  It produced a dbra loop that beat
>the sh*t out of the supposedly carefully handcoded one
>in the C library.  (Which is a Duffish sort of thing that
>tries to copy fullwords at a time.  Not Duff's device,
>but structurally similar.)

Alas, gcc is rather the exception, although this is changing (more
vendors are discovering gcc!), and it is disappointing how many
supposedly carefully handcoded simple-but-time-consuming routines
(bcopy, bzero, bcmp; or on SysV, memmove, memset, memcmp) were actually
thrown together by someone who had no idea what was optimal.

Just for fun, here is my bcopy-in-C for a 68010, assuming you have a
compiler that is as smart as gcc.  (In fact, it has been tested with
gcc, and, at least with the gcc on our Suns, compilers very nicely,
aside from one #ifdef.  Curiously, the gcc-specific hack is necessary
only for one of the two loops.)

/*
 * Copyright (c) 1989 by the University of Maryland Computer Science
 * Department.  All rights reserved.  Permission to copy for any
 * purpose is hereby granted so long as the copyright notice remains
 * intact.
 */

#ifdef test
typedef unsigned int size_t;
#else
#include <stddef.h>
#endif

/*
 * Block copy from src to dst, len bytes.
 * Allows misaligned copies (with attendant performance penalty)
 * and allows overlapping regions.
 *
 * We make the asumption that <read,write,read,write> cycles are
 * as fast as <read,read,write,write> cycles so that we can move
 * shortwords at a time.  If this is not true (e.g., if writes are
 * done `instantly' through a write buffer), this is suboptimal,
 * and the loop should be changed to move longwords at a time.
 */
void bcopy(register char *src, register char *dst, register size_t len) {
	register size_t l2;

	if (len == 0)
		return;
	/* check for potential overlap */
	if ((unsigned long)src + len > (unsigned long)dst) {
		/* copy backwards */
		src += len;
		dst += len;
		/* get operands word-aligned */
		if ((long)src & 1)
			*--dst = *--src;
		if ((long)dst & 1) {
			/* cannot word-align both operands */
			while (--len != (size_t)-1)
				*--dst = *--src;
			return;
		}
		/* copy words until no words left */
		l2 = len >> 1;
		while (--l2 != (size_t)-1) {
#ifndef __GNUC__
			/* for some reason, gcc does not optimise this */
			*(short *)dst = *(short *)src;
			dst -= sizeof (short);
			src -= sizeof (short);
#else
			/* so we use a gcc extension */
			*--(short *)dst = *--(short *)src;
#endif
		}
		/* copy last byte, if any */
		if (len & 1)
			*--dst = *--src;
	} else {
		/* copy forwards, otherwise identical */
		if ((long)src & 1)
			*dst++ = *src++;
		if ((long)dst & 1) {
			while (--len != (size_t)-1)
				*dst++ = *src++;
			return;
		}
		l2 = len >> 1;
		while (--l2 != (size_t)-1) {
			*(short *)dst = *(short *)src;
			dst += sizeof (short);
			src += sizeof (short);
		}
		if (len & 1)
			*dst = *src;
	}
}
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris