Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!brutus.cs.uiuc.edu!apple!sun-barr!ames!amdcad!light!bvs From: bvs@light.uucp (Bakul Shah) Newsgroups: comp.lang.c Subject: Re: faster bcopy using duffs device (source) Keywords: hacks Message-ID: <1989Sep8.224335.8825@light.uucp> Date: 8 Sep 89 22:43:33 GMT References: <5180@portia.Stanford.EDU> <19473@mimsy.UUCP> Reply-To: bvs@light.UUCP (Bakul Shah) Organization: - Lines: 53 In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu >(stergios marinopoulos) writes: >>I wanted a faster bcopy, so I used duffs device as a basis for it. > >bcopy() should be written in assembly (on most processors), put in >a library, and forgotten about, because---for instance---a dbra loop >beats a Duff loop on a 68010, every time. A couple more points. Even on a single processor different trade-offs exist for different amount of copying (e.g. use of movems on a 68000 for large copies) or different alignments (e.g. word copies when src,dst are word aligned, something else when they are not). Vendors providing stdlib should mess with such details. It is preferable to use standard functions whenever possible (memcpy instead of bcopy), since ANSI compilers can optimize them much better. For instance, on a particular machine a compiler may choose to inline something like memcpy(void * dst, void * src, unsigned count) { /* copy right here for small counts */ if (count < BREAKEVENCOUNT) { char * d = (char *)dst; char * s = (char *)src; unsigned c = count; while (c-- != 0) *d++ = *s++; return dst; } /* call a function depending on relative alignment */ return (((unsigned)dst&3) == ((unsigned)src&3) ? __alignedcpy : __unalignedcpy ) (dst, src, count); } Ain't that Disgusting! Even more can be done if the count happens to be a constant -- though this case can only be handled in a compiler (as there is no preprocessor equiv. of #if defined(xxx) for detecting constants). Inlining is especially useful when small amounts have to be copied. Anyway, it is best to hide this in a compiler or stdlib.h. To give you a datapoint, on a AMD29000 such tricks cut time down from about 10 cycles/byte in C code to under 0.7 cycles/byte for aligned src, dst and 0.9 cycle/byte for unaligned src, dst (for copying about 100 bytes). For very large copies it is possible to approach 29k's limit of 0.5 cycles/bytes within 5% -- assuming data memory can stream. -- Bakul Shah <..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs>