Xref: utzoo comp.lang.c:12317 comp.arch:6226 Path: utzoo!attcan!uunet!tektronix!orca!tekecs!frip!andrew From: andrew@frip.gwd.tek.com (Andrew Klossner) Newsgroups: comp.lang.c,comp.arch Subject: Re: Explanation, please! Message-ID: <10329@tekecs.TEK.COM> Date: 1 Sep 88 18:55:47 GMT References: <638@paris.ics.uci.edu> <189@bales.UUCP> Sender: andrew@tekecs.TEK.COM Organization: Tektronix, Wilsonville, Oregon Lines: 37 Nathaniel Stitt writes: "Here is my own personal version of the "Portable Optimized Copy" routine. It certainly seems more clear than the above example, and I would expect it to be at least as fast on virtually any machine." then goes on to present a routine which uses follow-on code to handle the last few bytes after all octets have been copied. It's cleaner code, but it won't be quite as fast on many systems with instruction caches because it has fifteen byte-move instructions, replacing eight in the original, so more time is spent loading the loop into the i-cache. On systems with very small i-caches (my favorite example is the IBM 360/91 with 16 bytes), the bigger loop may not all fit into cache, and would be considerably slower. Several contributors have suggested that unrolling a byte-copy loop is a win. On some architectures it is, but on a good pipelined system it may not be. As an example, the program fragment while (count--) { to[i] = from[i]; ++i; } can be compiled to code on the M88k which copies memory as fast as a DMA controller could; the instructions to decrement, increment, and branch overlap with the data load/store requests. [If everything's in registers, indexing in this case is actually faster than keeping separate "to" and "from" pointers and incrementing both.] This assumes that "to" and "from" are pointers-to-ints or pointers-to-doubles. Copying less than a word at a time is slower. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]