Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!snorkelwacker!paperboy!meissner
From: meissner@osf.org (Michael Meissner)
Newsgroups: comp.unix.wizards
Subject: Re: fastest way to copy hunks of memory
Message-ID: <MEISSNER.90May7112949@curley.osf.org>
Date: 7 May 90 15:29:49 GMT
References: <5531@helios.ee.lbl.gov> <1990May2.200732.11851@eci386.uucp>
	<1990May4.172145.4085@agate.berkeley.edu>
Sender: news@OSF.ORG
Organization: Open Software Foundation
Lines: 71
In-reply-to: c60c-3cf@e260-3f.berkeley.edu's message of 4 May 90 17:21:45 GMT

In article <1990May4.172145.4085@agate.berkeley.edu>
c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) writes:

| In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
| >Perhaps 
| >
| >    while(size--)
| >	*p1++ = *p2++;
| 
| or even
| 
| void *memcpy(void *to, void *from, size_t size){
| 	register int 	size_l = size / 4,	/* or (size >> log2(sizeof int)) */
| 					tail = size % 4;	/* or (size & log2(sizeof int)) */
| 	void			*result = to;				
| 	while(size_l--) (int *)to++ = (int *)from++;
| 	while(tail--) (char *)p1++ = (char *)p2++;
| 	return result;
| }
| 
| 	This shold work almost 4 times as fast compared to just inclementing
| by bytes--it uses full length of register.  The problem is that it doesn't
| work if either (void *to) and (void *from) is not aligned and the macine
| architecure doesn't allow unaligned assignment.  Such functions as
| memcpy() should be written in assembler, I think...

The above code will not work on machines with strict alignment
requirements (ie, RISC machines) if either the 'to' or 'from' pointers
are not aligned on input, since the user could certainly do something
like:

	memcpy (to+1, from, size);

It also will not work under ANSI C compilers, since the construction:

	(int *)to++ = ...

is illegal ANSI C.  Finally, to get the most of the performance on
RISC machines, you have to know about the underlying machine
characteristics.  For example, on the 88k, there is a 2 cycle delay
after the load instruction has been initiated, and before it is in a
register (there are hardware interlocks, so that even naive code will
work).  Thus on the 88k, after dealing with any initial unaligned
pointers, and such, the main loop would look like:

	...

	{
		register int word1, word2, *word_to, *word_from;

		word_to = (int *) to;
		word_from = (int *) from;

		do {
			word1 = word_from[0];
			word2 = word_from[1];
			word_from += 2;
			size -= 2 * sizeof (int);
			word_to[0] = word1;
			word_to[1] = word2;
			word_to += 2;
		} while ( size > 2 * sizeof(int) );
	}

Optimizing bcopy/memcpy/memmove is not as simple as it looks.  It
takes a lot of skull sweat, and worrying about unusual cases.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so