Path: utzoo!utgpu!water!watmath!clyde!att!alberta!ubc-cs!uw-beaver!mit-eddie!rutgers!ucla-cs!admin.cognet.ucla.edu!casey From: casey@admin.cognet.ucla.edu (Casey Leedom) Newsgroups: comp.arch Subject: Re: Explanation, please! Message-ID: <15576@shemp.CS.UCLA.EDU> Date: 28 Aug 88 23:38:52 GMT References: <28200192@urbsdc> Sender: news@CS.UCLA.EDU Reply-To: casey@admin.cognet.ucla.edu.UUCP (Casey Leedom) Organization: none Lines: 47 In article <28200192@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: | > > Duff's device | > | > That's the most hackish way of trying to write a portable optimized | > copy routine I've ever seen. I gather the whole point of the shenanigans | > is to get all the *from++ -> *to++ instructions in the generated code to be | > adjacent. | | Well, that's a start. Duff's device does better on some machines (like, | mine) that don't have autoincrement addressing modes, and even on some | that do, if you use indexing instead of incrementing. Unfortunately, | that tends to make you run the copy backwards, which louses up some | caches. | | > This only makes if the author knows he's got a hardware instruction | > pipeline or cache that's no less than 8 and no more than 9 byte-copy | > instruction widths long, and stuff executing out of the pipeline is a lot | > faster than if the copies are interleaved with control transfers. Dollars | > to doughnuts this code was written on a RISC machine. I wrote the bcopy routine for 2.10BSD with some loop unrolling. But there the motivation wasn't pipelining, caching, or control/transfer considerations. Basically, it's very simple: 1: mov (r0)+, (r1)+ ; 1.50 usec sob r2, 1b ; 0.60 usec sob (subtract 1 and branch if non-zero) is 28% of total loop time. vs: 2: mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec sob r2, 2b ; 0.60 usec sob is 10% of total loop time. The other big performance win was to optimize odd to odd, even to even with odd length, etc. copies which previous versions of bcopy implemented as byte copy loops. Casey