Xref: utzoo comp.lang.c:12173 comp.arch:6185 Newsgroups: comp.lang.c,comp.arch Path: utzoo!henry From: henry@utzoo.uucp (Henry Spencer) Subject: Re: Explanation, please! Message-ID: <1988Aug29.002101.213@utzoo.uucp> Organization: U of Toronto Zoology References: <638@paris.ics.uci.edu> Date: Mon, 29 Aug 88 00:21:01 GMT In article eric@snark.UUCP (Eric S. Raymond) writes: >This only makes if the author knows he's got a hardware instruction pipeline >or cache that's no less than 8 and no more than 9 byte-copy instruction widths >long, and stuff executing out of the pipeline is a lot faster than if the >copies are interleaved with control transfers. Dollars to doughnuts this code >was written on a RISC machine. Nope. Bell Labs Research uses VAXen and 68Ks, mostly. The key point is not pipelining, but loop-control overhead. There is in fact a tradeoff here: unrolling the loop further will reduce control overhead further, but will increase code size. That last is of some significance when caching gets into the act: cache-loading overhead favors short loops, and small cache sizes very strongly favor short ones. In general there is an optimal point in there somewhere, and an unrolling factor of 8 or 16 is a pretty good guess at it on the machines I've looked at closely. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu