Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.arch,comp.lang.c Subject: Re: String Handling ( really fixed-length copy ). Message-ID: <1038@aw.sei.cmu.edu> Date: Mon, 20-Apr-87 08:52:59 EST Article-I.D.: aw.1038 Posted: Mon Apr 20 08:52:59 1987 Date-Received: Tue, 21-Apr-87 00:35:35 EST References: <15292@amdcad.UUCP> <7897@utzoo.UUCP> <4558@utcsri.UUCP> Sender: netnews@sei.cmu.edu Reply-To: firth@bd.sei.cmu.edu.UUCP (PUT YOUR NAME HERE) Organization: Carnegie-Mellon University, SEI, Pgh, Pa Lines: 31 Xref: mnetor comp.arch:1013 comp.lang.c:1793 In article <4558@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes: >This string op-stuff gave me an idea. A run-time library could contain >a function called 'mov200words' looking like this : > >mov200words: mov (a0)+,(a1)+ > mov (a0)+,(a1)+ > ..... 200 mov's in all > mov (a0)+,(a1)+ > rts > >Then, if, say, a 64-word struct needed to be copied, the compiler would get >the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the >copy. This would provide unrolled-loop speed with only one loop unrolled in >the whole executable. [ Call it more than once for >200 words ]. Presumably >this would be faster than a loop on a PDP-11 or a 68000, but might lose on a >machine with an instruction cache, that could run a copy loop on-chip. A wizzo >block copy instruction may or may not run faster than the unrolled loop. Great idea, Gregory! I saw this implemented in a Pascal/PDP-11 compiler, and it fascinated me then. Whether it's faster or slower than a block move or a loop depends on the machine. For instance, on a RISCy machine with a separate I-bus, the limiting factor is data accesses anyway, so everything takes about the same time. On the Vax-11/780, the MOVC3 seems to take almost the same time as the equivalent number of MOVLs, and rather more time than MOVQs. Since it also destroys 6 registers, it should be avoided. Of course, for small structures you generate the sequence inline; on a Vax maybe 7 or 8 MOVQs is OK, after that better go to the subroutine (called by JSR of course). Has anyone published statistics on the size distribution of Pascal arrays & records?