Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.arch,comp.lang.c
Subject: Re: String Handling ( really fixed-length copy ).
Message-ID: <1038@aw.sei.cmu.edu>
Date: Mon, 20-Apr-87 08:52:59 EST
Article-I.D.: aw.1038
Posted: Mon Apr 20 08:52:59 1987
Date-Received: Tue, 21-Apr-87 00:35:35 EST
References: <15292@amdcad.UUCP> <7897@utzoo.UUCP> <4558@utcsri.UUCP>
Sender: netnews@sei.cmu.edu
Reply-To: firth@bd.sei.cmu.edu.UUCP (PUT YOUR NAME HERE)
Organization: Carnegie-Mellon University, SEI, Pgh, Pa
Lines: 31
Xref: mnetor comp.arch:1013 comp.lang.c:1793

In article <4558@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>This string op-stuff gave me an idea. A run-time library could contain
>a function called 'mov200words' looking like this :
>
>mov200words:	mov	(a0)+,(a1)+
>		mov	(a0)+,(a1)+
>		.....	200 mov's in all
>		mov	(a0)+,(a1)+
>		rts
>
>Then, if, say, a 64-word struct needed to be copied, the compiler would get
>the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the
>copy. This would provide unrolled-loop speed with only one loop unrolled in
>the whole executable. [ Call it more than once for >200 words ].  Presumably
>this would be faster than a loop on a PDP-11 or a 68000, but might lose on a
>machine with an instruction cache, that could run a copy loop on-chip. A wizzo
>block copy instruction may or may not run faster than the unrolled loop.

Great idea, Gregory!  I saw this implemented in a Pascal/PDP-11 compiler,
and it fascinated me then.  Whether it's faster or slower than a block
move or a loop depends on the machine.  For instance, on a RISCy machine
with a separate I-bus, the limiting factor is data accesses anyway, so
everything takes about the same time.  On the Vax-11/780, the MOVC3 seems
to take almost the same time as the equivalent number of MOVLs, and rather
more time than MOVQs.  Since it also destroys 6 registers, it should be
avoided.

Of course, for small structures you generate the sequence inline; on a Vax
maybe 7 or 8 MOVQs is OK, after that better go to the subroutine (called
by JSR of course).  Has anyone published statistics on the size distribution
of Pascal arrays & records?