Xref: utzoo comp.lang.c:12317 comp.arch:6226
Path: utzoo!attcan!uunet!tektronix!orca!tekecs!frip!andrew
From: andrew@frip.gwd.tek.com (Andrew Klossner)
Newsgroups: comp.lang.c,comp.arch
Subject: Re: Explanation, please!
Message-ID: <10329@tekecs.TEK.COM>
Date: 1 Sep 88 18:55:47 GMT
References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP> <189@bales.UUCP>
Sender: andrew@tekecs.TEK.COM
Organization: Tektronix, Wilsonville, Oregon
Lines: 37

Nathaniel Stitt writes:

	"Here is my own personal version of the "Portable Optimized
	Copy" routine.  It certainly seems more clear than the above
	example, and I would expect it to be at least as fast on
	virtually any machine."

then goes on to present a routine which uses follow-on code to handle
the last few bytes after all octets have been copied.  It's cleaner
code, but it won't be quite as fast on many systems with instruction
caches because it has fifteen byte-move instructions, replacing eight
in the original, so more time is spent loading the loop into the
i-cache.  On systems with very small i-caches (my favorite example is
the IBM 360/91 with 16 bytes), the bigger loop may not all fit into
cache, and would be considerably slower.

Several contributors have suggested that unrolling a byte-copy loop is
a win.  On some architectures it is, but on a good pipelined system it
may not be.  As an example, the program fragment

	while (count--) {
		to[i] = from[i];
		++i;
	}

can be compiled to code on the M88k which copies memory as fast as a
DMA controller could; the instructions to decrement, increment, and
branch overlap with the data load/store requests.

[If everything's in registers, indexing in this case is actually faster
than keeping separate "to" and "from" pointers and incrementing both.]

This assumes that "to" and "from" are pointers-to-ints or
pointers-to-doubles.  Copying less than a word at a time is slower.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]