Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-crg!lll-lcc!pyramid!decwrl!sun!guy From: guy@sun.uucp (Guy Harris) Newsgroups: net.lang.c Subject: Re: Re: structure alignment question Message-ID: <7503@sun.uucp> Date: Mon, 22-Sep-86 16:07:55 EDT Article-I.D.: sun.7503 Posted: Mon Sep 22 16:07:55 1986 Date-Received: Tue, 23-Sep-86 05:59:55 EDT References: <101@hcx1.UUCP> <7363@sun.uucp> <696@mips.UUCP> <7447@sun.uucp> <1705@mcc-pp.UUCP> <3527@umcp-cs.UUCP> Organization: Sun Microsystems, Inc. Lines: 135 > (Does the 68020 really fault? I thought it just did two bus accesses.) "fault" was a poor choice of words on his part; it already means something, namely a trap. The '020 doesn't fault, it just does two bus accesses. > | strcpy(to, from) char *to, *from; { *to = *from; return (to); } > | /*UNTESTED!*/ When tested with some simple test cases (even->even, odd->even, even->odd, odd->odd), it worked. > TO = a0 | I think this works It does. > | I forget if this is legal. If not, copy to d0 first. > btst #0,TO | test for odd destination It's not, and I modified it to do so. > I wonder, though, if this is truly faster. Should not a movb/bnes > pair run in loop mode? Nope. Only "dbCC" loops run in loop mode. > (Perhaps not; `dbcc' loops do, though, and one could use a dbra surrounded > by a bit of extra logic.) Yes. The following, courtesy of John Gilmore and Vaughan Pratt, is what is actually used in the (3.2 version of) "strcpy", etc.: moveq #-1,d1 | maximum possible (16-bit) count hardloop: movb FROM@+,TO@+ | copy... dbeq d1,hardloop | until we copy a null or the count is -1 bne hardloop | if not-null, continue copying with count | freshly initialized to -1 Now for the numbers. A test program was built to do a large number of copies in a loop, and to do the same loop with no body; the times were subtracted and the result was divided by the number of iterations. The program was run with strings of length 2, 10, and 100. All strings were "malloc"ed, so they started on word boundaries (the program tested this, just to make sure). The results: Byte-by-byte copy, using "movb"/"dbcc" loop (standard 3.2 "strcpy), Sun-2 (10 MhZ 68010, no caches, zero wait states): 250000 copies of 2 bytes took 5.720000 seconds 0.000023 seconds/call 50000 copies of 10 bytes took 1.760000 seconds 0.000035 seconds/call 5000 copies of 100 bytes took 0.860000 seconds 0.000172 seconds/call New strcpy, same Sun-2: 250000 copies of 2 bytes took 8.440000 seconds 0.000034 seconds/call 50000 copies of 10 bytes took 3.120000 seconds 0.000062 seconds/call 5000 copies of 100 bytes took 1.880000 seconds 0.000376 seconds/call Standard strcpy, Sun-3/75 (16.67 MhZ 68020, no caches other than the on-chip 256-byte instruction cache, 1.5 wait states): 250000 copies of 2 bytes took 1.780000 seconds 0.000007 seconds/call 50000 copies of 10 bytes took 0.720000 seconds 0.000014 seconds/call 5000 copies of 100 bytes took 0.500000 seconds 0.000100 seconds/call New strcpy, same Sun-3/75: 250000 copies of 2 bytes took 2.800000 seconds 0.000011 seconds/call 50000 copies of 10 bytes took 0.960000 seconds 0.000019 seconds/call 5000 copies of 100 bytes took 0.520000 seconds 0.000104 seconds/call Standard strcpy, Sun-3/200 (25 MhZ 68020, off-chip write-back cache, 0 wait states): 250000 copies of 2 bytes took 1.060000 seconds 0.000004 seconds/call 50000 copies of 10 bytes took 0.480000 seconds 0.000010 seconds/call 5000 copies of 100 bytes took 0.260000 seconds 0.000052 seconds/call New strcpy, same Sun-3/200: 250000 copies of 2 bytes took 1.420000 seconds 0.000006 seconds/call 50000 copies of 10 bytes took 0.520000 seconds 0.000010 seconds/call 5000 copies of 100 bytes took 0.320000 seconds 0.000064 seconds/call These numbers were quite reproducible. The moral(s) of the story: 1) Loop mode, on the 010, is a big win. (The byte-by-byte "strcpy" runs in loop mode on the 010, the other one doesn't; the other one takes about twice as long.) 2) The instruction cache, on the 020, is a big win. (The 020 versions don't differ by as much, and the other one seems to be catching up as the strings get longer, which didn't happen on the 010.) 3) With realistic string lengths, and 68K-family machines offered by Sun, at least, the plain vanilla byte-by-byte copy is the right way to do things, even with word-aligned strings. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)