Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!brutus.cs.uiuc.edu!apple!sun-barr!ames!amdcad!light!bvs
From: bvs@light.uucp (Bakul Shah)
Newsgroups: comp.lang.c
Subject: Re: faster bcopy using duffs device (source)
Keywords: hacks
Message-ID: <1989Sep8.224335.8825@light.uucp>
Date: 8 Sep 89 22:43:33 GMT
References: <5180@portia.Stanford.EDU> <19473@mimsy.UUCP>
Reply-To: bvs@light.UUCP (Bakul Shah)
Organization: -
Lines: 53

In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu
>(stergios marinopoulos) writes:
>>I wanted a faster bcopy, so I used duffs device as a basis for it.
>
>bcopy() should be written in assembly (on most processors), put in
>a library, and forgotten about, because---for instance---a dbra loop
>beats a Duff loop on a 68010, every time.

A couple more points.

Even on a single processor different trade-offs exist for different
amount of copying (e.g. use of movems on a 68000 for large copies) or
different alignments (e.g. word copies when src,dst are word aligned,
something else when they are not).  Vendors providing stdlib should mess
with such details.

It is preferable to use standard functions whenever possible (memcpy
instead of bcopy), since ANSI compilers can optimize them much better.
For instance, on a particular machine a compiler may choose to inline
something like

	memcpy(void * dst, void * src, unsigned count)
	{
		/* copy right here for small counts */
		if (count < BREAKEVENCOUNT) {
			char * d = (char *)dst;
			char * s = (char *)src;
			unsigned c = count;
			while (c-- != 0)
				*d++ = *s++;
			return dst;
		}
		/* call a function depending on relative alignment */
		return
			(((unsigned)dst&3) == ((unsigned)src&3) ?
			 __alignedcpy : __unalignedcpy
			) (dst, src, count);
	}

Ain't that Disgusting!  Even more can be done if the count happens
to be a constant -- though this case can only be handled in a compiler
(as there is no preprocessor equiv. of #if defined(xxx) for detecting
constants).  Inlining is especially useful when small amounts have to
be copied.  Anyway, it is best to hide this in a compiler or stdlib.h.

To give you a datapoint, on a AMD29000 such tricks cut time down from
about 10 cycles/byte in C code to under 0.7 cycles/byte for aligned src,
dst and 0.9 cycle/byte for unaligned src, dst (for copying about 100
bytes).  For very large copies it is possible to approach 29k's limit of
0.5 cycles/bytes within 5% -- assuming data memory can stream.

-- Bakul Shah <..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs>