Path: utzoo!utgpu!water!watmath!clyde!att!alberta!ubc-cs!uw-beaver!mit-eddie!rutgers!ucla-cs!admin.cognet.ucla.edu!casey
From: casey@admin.cognet.ucla.edu (Casey Leedom)
Newsgroups: comp.arch
Subject: Re: Explanation, please!
Message-ID: <15576@shemp.CS.UCLA.EDU>
Date: 28 Aug 88 23:38:52 GMT
References: <dpmuY#2EBC4R=eric@snark.UUCP> <28200192@urbsdc>
Sender: news@CS.UCLA.EDU
Reply-To: casey@admin.cognet.ucla.edu.UUCP (Casey Leedom)
Organization: none
Lines: 47

In article <28200192@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
| > > Duff's device
| > 
| > That's the most hackish way of trying to write a portable optimized
| > copy routine I've ever seen. I gather the whole point of the shenanigans
| > is to get all the *from++ -> *to++ instructions in the generated code to be
| > adjacent.
| 
| Well, that's a start. Duff's device does better on some machines (like,
| mine) that don't have autoincrement addressing modes, and even on some
| that do, if you use indexing instead of incrementing.  Unfortunately,
| that tends to make you run the copy backwards, which louses up some
| caches.
| 
| > This only makes if the author knows he's got a hardware instruction
| > pipeline or cache that's no less than 8 and no more than 9 byte-copy
| > instruction widths long, and stuff executing out of the pipeline is a lot
| > faster than if the copies are interleaved with control transfers. Dollars
| > to doughnuts this code was written on a RISC machine.

  I wrote the bcopy routine for 2.10BSD with some loop unrolling.  But
there the motivation wasn't pipelining, caching, or control/transfer
considerations.  Basically, it's very simple:

	1:
		mov	(r0)+, (r1)+	; 1.50 usec
		sob	r2, 1b		; 0.60 usec

	sob (subtract 1 and branch if non-zero) is 28% of total loop time.

vs:

	2:
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		sob	r2, 2b		; 0.60 usec


	sob is 10% of total loop time.

  The other big performance win was to optimize odd to odd, even to even
with odd length, etc. copies which previous versions of bcopy implemented
as byte copy loops.

Casey