Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!lll-crg!lll-lcc!pyramid!decwrl!sun!guy
From: guy@sun.uucp (Guy Harris)
Newsgroups: net.lang.c
Subject: Re: Re: structure alignment question
Message-ID: <7503@sun.uucp>
Date: Mon, 22-Sep-86 16:07:55 EDT
Article-I.D.: sun.7503
Posted: Mon Sep 22 16:07:55 1986
Date-Received: Tue, 23-Sep-86 05:59:55 EDT
References: <101@hcx1.UUCP> <7363@sun.uucp> <696@mips.UUCP> <7447@sun.uucp> <1705@mcc-pp.UUCP> <3527@umcp-cs.UUCP>
Organization: Sun Microsystems, Inc.
Lines: 135

> (Does the 68020 really fault?  I thought it just did two bus accesses.)

"fault" was a poor choice of words on his part; it already means something,
namely a trap.  The '020 doesn't fault, it just does two bus accesses.

> 	| strcpy(to, from) char *to, *from; { *to = *from; return (to); }
> 	| /*UNTESTED!*/

When tested with some simple test cases (even->even, odd->even, even->odd,
odd->odd), it worked.

> 	TO	=	a0		| I think this works

It does.

> 	| I forget if this is legal.  If not, copy to d0 first.
> 		btst	#0,TO		| test for odd destination

It's not, and I modified it to do so.

> I wonder, though, if this is truly faster.  Should not a movb/bnes
> pair run in loop mode?

Nope.  Only "dbCC" loops run in loop mode.

> (Perhaps not; `dbcc' loops do, though, and one could use a dbra surrounded
> by a bit of extra logic.)

Yes.  The following, courtesy of John Gilmore and Vaughan Pratt, is what is
actually used in the (3.2 version of) "strcpy", etc.:

	moveq	#-1,d1		| maximum possible (16-bit) count
hardloop:
	movb	FROM@+,TO@+	| copy...
	dbeq	d1,hardloop	| until we copy a null or the count is -1
	bne	hardloop	| if not-null, continue copying with count
				| freshly initialized to -1

Now for the numbers.  A test program was built to do a large number of
copies in a loop, and to do the same loop with no body; the times were
subtracted and the result was divided by the number of iterations.  The
program was run with strings of length 2, 10, and 100.  All strings were
"malloc"ed, so they started on word boundaries (the program tested this,
just to make sure).  The results:

Byte-by-byte copy, using "movb"/"dbcc" loop (standard 3.2 "strcpy), Sun-2
(10 MhZ 68010, no caches, zero wait states):

	250000 copies of 2 bytes took 5.720000 seconds
	0.000023 seconds/call

	50000 copies of 10 bytes took 1.760000 seconds
	0.000035 seconds/call

	5000 copies of 100 bytes took 0.860000 seconds
	0.000172 seconds/call

New strcpy, same Sun-2:

	250000 copies of 2 bytes took 8.440000 seconds
	0.000034 seconds/call

	50000 copies of 10 bytes took 3.120000 seconds
	0.000062 seconds/call

	5000 copies of 100 bytes took 1.880000 seconds
	0.000376 seconds/call

Standard strcpy, Sun-3/75 (16.67 MhZ 68020, no caches other than the on-chip
256-byte instruction cache, 1.5 wait states):

	250000 copies of 2 bytes took 1.780000 seconds
	0.000007 seconds/call

	50000 copies of 10 bytes took 0.720000 seconds
	0.000014 seconds/call

	5000 copies of 100 bytes took 0.500000 seconds
	0.000100 seconds/call

New strcpy, same Sun-3/75:

	250000 copies of 2 bytes took 2.800000 seconds
	0.000011 seconds/call

	50000 copies of 10 bytes took 0.960000 seconds
	0.000019 seconds/call

	5000 copies of 100 bytes took 0.520000 seconds
	0.000104 seconds/call

Standard strcpy, Sun-3/200 (25 MhZ 68020, off-chip write-back cache, 0 wait
states):

	250000 copies of 2 bytes took 1.060000 seconds
	0.000004 seconds/call

	50000 copies of 10 bytes took 0.480000 seconds
	0.000010 seconds/call

	5000 copies of 100 bytes took 0.260000 seconds
	0.000052 seconds/call

New strcpy, same Sun-3/200:

	250000 copies of 2 bytes took 1.420000 seconds
	0.000006 seconds/call

	50000 copies of 10 bytes took 0.520000 seconds
	0.000010 seconds/call

	5000 copies of 100 bytes took 0.320000 seconds
	0.000064 seconds/call

These numbers were quite reproducible.

The moral(s) of the story:

	1) Loop mode, on the 010, is a big win.  (The byte-by-byte
	   "strcpy" runs in loop mode on the 010, the other one
	   doesn't; the other one takes about twice as long.)

	2) The instruction cache, on the 020, is a big win.  (The 020
	   versions don't differ by as much, and the other one seems
	   to be catching up as the strings get longer, which didn't
	   happen on the 010.)

	3) With realistic string lengths, and 68K-family machines
	   offered by Sun, at least, the plain vanilla byte-by-byte
	   copy is the right way to do things, even with word-aligned
	   strings.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)