Path: utzoo!attcan!uunet!cs.utexas.edu!sdd.hp.com!usc!jarthur!nntp-server.caltech.edu!seismo.gps.caltech.edu!bruce
From: bruce@seismo.gps.caltech.edu (Bruce Worden)
Newsgroups: comp.lang.c
Subject: Re: A study in code optimization in C
Summary: Some statistics for various machines
Keywords: memcopy
Message-ID: <1990Jul28.203800.17258@laguna.ccsf.caltech.edu>
Date: 28 Jul 90 20:38:00 GMT
References: <133@smds.UUCP> <1990Jul26.144134.16053@ux1.cso.uiuc.edu> <1349@proto.COM>
Sender: bruce@seismo.gps.caltech.edu (Bruce Worden)
Organization: Seismological Laboratory, California Institute of Technology, CA
Lines: 100

In article <1349@proto.COM> joe@proto.COM (Joe Huffman) writes:
>In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
>> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>> >
>> >The macro shown below is an optimized memory to memory copy macro.
>> >It is probably faster than memcopy on your machine -- I have checked
>> >it on several machines and have always found it to be faster.
>>                                  !!!!!!
>> Oh My!.
>> Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array:
>>                           His code                   library memcpy       
>> Compiler 1:
>>                (chars)     12.6                            2.7
>>                (ints)       6.9                            2.7
>> Compiler 2:
>>                (chars)     23.6                            1.3
>>                (ints)       6.9                            1.3
>[Stuff deleted... compilers were Microsoft and Microway NDPC, machine was
>20 MHz 386]
>
>I just ran it on a 20 MHz 386 running SCO UNIX.  The timing were done with 
>5000 copies but then divided by 5 to make the numbers comparable.
>			   His code		       library memcpy
>SCO supplied MSC 5.1
>		(chars)	    14.0		             2.05
>Zortech
>	     386 code generator not available		     1.80

Here are the results on some machines I could find the other day.  The 
compilers are the native compilers unless otherwise stated.  I used 
whatever compiler optimizations I could.  20kbyte arrays, 1000 copies:

Sun Sparcstation 1+
        Him  		 memcpy    
chars: 7.6 		 2.0     
ints:  2.0 		 2.0     

Sun 4/280
	Him              memcpy    
chars: 9.8               2.8     
ints:  2.5               2.8

Sun Sparcstation SLC
        Him              memcpy    
chars: 9.9               2.6     
ints:  2.5               2.6

Sun 386i
	Him              memcpy    
chars: 9.5               2.6     
ints:  2.4               2.6

Sun 3/160
	Him              memcpy    
chars: 13.7              4.5     
ints:  3.4               4.5

Inmos T800 (Meiko, 25MHz, kind-of unfair because of block_copy instruction)
        Him              memcpy    
chars: 37.6              1.6     
ints:  8.4               1.6

i860 (Meiko, 40MHz, Green Hills C-I860 1.8.5, beta assembler 1.41, beta 
linker 1.2)
	Him              memcpy     
chars: 2.1               3.9      
ints:  0.9               3.9

Convex C120 (Vector--yes his code vectorizes nicely, memcpy not available, 
used bcopy)
	Him              memcpy    
chars: 3.0               1.0      
ints:  1.0               1.0

Convex C120 (Scalar, memcpy not available, used bcopy)
        Him              memcpy    
chars: 28.4              1.5     
ints:  7.5               1.5

BBN TC2000 (Motorola 88000-based, Green Hills C-88000 2.35(1.8.4))
        Him              memcpy    
chars: 10.3              12.0     
ints:  4.9               12.0

In general, I'd say Richard's code does a pretty good job when moving int's,
and also when compared to young machines (the BBN and the Meiko i860.)
In addition, his code is about 20% faster than a simple "for" loop on my
Sparc 1+, so it illustrates a useful principle as well.  I intend to
use it in some selected applications, thanks for posting it.

BIG TIME DISCLAIMER: I in no way intended this to be a comparison of 
different machines, but of the performance of a piece of C code on each of
several different machines.  There are a lot of ways to do timings, and most 
of them aren't very good, so please don't flame me if I didn't do justice to 
some machine's absolute performance, it is the relative timings that matter.
If I screwed that up, flame away (though a nice note explaining the error
might be more instructive.)
						Bruce
P.S. For timing I used getusecclock() on the BBN, ticks() on the Meiko's, and 
getrusage() on everything else.