Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!snorkelwacker!paperboy!meissner From: meissner@osf.org (Michael Meissner) Newsgroups: comp.unix.wizards Subject: Re: fastest way to copy hunks of memory Message-ID: Date: 7 May 90 15:29:49 GMT References: <5531@helios.ee.lbl.gov> <1990May2.200732.11851@eci386.uucp> <1990May4.172145.4085@agate.berkeley.edu> Sender: news@OSF.ORG Organization: Open Software Foundation Lines: 71 In-reply-to: c60c-3cf@e260-3f.berkeley.edu's message of 4 May 90 17:21:45 GMT In article <1990May4.172145.4085@agate.berkeley.edu> c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) writes: | In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: | >Perhaps | > | > while(size--) | > *p1++ = *p2++; | | or even | | void *memcpy(void *to, void *from, size_t size){ | register int size_l = size / 4, /* or (size >> log2(sizeof int)) */ | tail = size % 4; /* or (size & log2(sizeof int)) */ | void *result = to; | while(size_l--) (int *)to++ = (int *)from++; | while(tail--) (char *)p1++ = (char *)p2++; | return result; | } | | This shold work almost 4 times as fast compared to just inclementing | by bytes--it uses full length of register. The problem is that it doesn't | work if either (void *to) and (void *from) is not aligned and the macine | architecure doesn't allow unaligned assignment. Such functions as | memcpy() should be written in assembler, I think... The above code will not work on machines with strict alignment requirements (ie, RISC machines) if either the 'to' or 'from' pointers are not aligned on input, since the user could certainly do something like: memcpy (to+1, from, size); It also will not work under ANSI C compilers, since the construction: (int *)to++ = ... is illegal ANSI C. Finally, to get the most of the performance on RISC machines, you have to know about the underlying machine characteristics. For example, on the 88k, there is a 2 cycle delay after the load instruction has been initiated, and before it is in a register (there are hardware interlocks, so that even naive code will work). Thus on the 88k, after dealing with any initial unaligned pointers, and such, the main loop would look like: ... { register int word1, word2, *word_to, *word_from; word_to = (int *) to; word_from = (int *) from; do { word1 = word_from[0]; word2 = word_from[1]; word_from += 2; size -= 2 * sizeof (int); word_to[0] = word1; word_to[1] = word2; word_to += 2; } while ( size > 2 * sizeof(int) ); } Optimizing bcopy/memcpy/memmove is not as simple as it looks. It takes a lot of skull sweat, and worrying about unusual cases. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so