Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ukma!husc6!xait!g-rh From: g-rh@XAIT.Xerox.COM (Richard Harter) Newsgroups: comp.lang.c Subject: Hand optimized copying Message-ID: <34421@XAIT.Xerox.COM> Date: 7 Oct 88 21:38:40 GMT Reply-To: g-rh@XAIT.Xerox.COM (Richard Harter) Organization: Xerox Corporation, Cambridge, Massachusetts Lines: 56 References: Several recent threads inspired me to take a close look at optimized copying. I tested a number of variations, doing both timing runs and inspection of the assembly language code produced. All tests were run on a Sun 3; results may be different on other machines. Here are my observations and results: Optimization: It is critical to have the optimizer turned on; the code generated for loop control is much tighter. Count loops: The following for a count loop (a loop which does something n times) is optimal: for (i = n+1; --i > 0;) {loop body} The >0 test does not cost anything extra. Alternative forms for the loop produce more expensive code. Switches and Duff's device: All variants using a switch were inferior. The problem with using a switch statement is that the initial code to check the range is a (relatively) expensive overhead. If one does use Duff's device the do-while loop conditional should be --count rather than count--. Functions versus macros: A function call adds a fixed overhead that dominates execution times for short copies. Loop unrolling: There is a gain of about 20-25% to be gained by loop unrolling. The best size appears to be 8; gains by using a larger factor are nominal, if any. Memcpy: Memcpy appears to use word copying if the source and destination fall on word boundaries and a tight byte copy otherwise. A tightly coded copy macro beats it if the source and destination are not on word boundaries. If you wish to avoid memcpy for portability reasons, it is profitable to use separate char, int, short, and long copy macros. Subscripts versus pointers: Pointers are superior. Best code: The following is a macro that appears to be optimal: #define copy(dest,src,nb) {register char *a, *b; register int i;\ a = dest; b = src;\ for (i = (nb %7)+1; --i;) *a++ = *b++;\ for (i = (nb>>3)+1; --i>0;) {\ *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\ *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\ }} Disclaimer 1: This code is not guaranteed to be free of typos. Disclaimer 2: Results are for the Sun 3. Disclaimer 3: There may tricks that I haven't thought of. -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.