Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!rutgers!mcnc!rti!xyzzy!meissner From: meissner@xyzzy.UUCP (Michael Meissner) Newsgroups: comp.lang.c Subject: Re: Portability of some overlapping strcpy or memcpy calls Message-ID: <4068@xyzzy.UUCP> Date: 13 Mar 89 15:37:22 GMT References: <338@wjh12.harvard.edu> Reply-To: meissner@tiktok.UUCP (Michael Meissner) Organization: Data General (Languages @ Research Triangle Park, NC.) Lines: 45 In article <338@wjh12.harvard.edu> kendall%saber@harvard.harvard.edu (Samuel C. Kendall) writes: | Consider the following function call: | | memcpy(p, p + M, N) | | where p is a char*, M is nonnegative, N is positive, and M < N. This | is an overlapping copy, where the bytes are being copied to the left (M | > 0) or onto themselves (M == 0). I am interested in finding out if | this call to memcpy, and similar calls to memccpy, strcpy, and strncpy, | are portable. Ok, if you want a real world example, consider systems based on the Motorola 88000. The chip has multiple functional units, pipelines, and hardware interlocks. When you access memory, there is a minimum of 3 clock periods after the instruction starts before either the register is loaded or memory is stored to. Thus, it is better to do multiple loads, followed by multiple stores to avoid stalling the processor. Thus the inner loop of memcpy would be something like: loop: ld r5,r3,0 ; r5 <- *src ld r6,r3,0x4 ; r6 <- *(src+4) ld r7,r3,0x8 ; r7 <- *(src+8) st r5,r2,0 ; store *src into *dest st r6,r2,0x4 ; store *(src+4) into *(dest+4) st r7,r2,0x8 ; store *(src+8) into *(dest+8) addu r3,r3,0xc ; bump src pointer subu r4,r4,0xc ; decrement length bcnd.n ge0,r4,loop ; loop back if more data to move addu r2,r2,0xc ; bump dest pointer (in delay slot) Thus if M were 4 or 8, and word aligned moves were done, you would lose, since the loads and stores are pipelined three deep. I haven't looked at the library routine for memcpy recently, but I know the authors did go out of their way to exploit the parallelism of the machine. The above code is roughly what the GNU 88k compiler currently produces when it knows word alignment is valid, and that the count is fixed. I would expect even more striking results on machines with vector units, since you should be able to make memcpy use the vector instructions of the machine. -- Michael Meissner, Data General. Uucp: ...!mcnc!rti!xyzzy!meissner Arpa: meissner@dg-rtp.DG.COM (or) meissner%dg-rtp.DG.COM@relay.cs.net