Xref: utzoo comp.lang.c:12173 comp.arch:6185
Newsgroups: comp.lang.c,comp.arch
Path: utzoo!henry
From: henry@utzoo.uucp (Henry Spencer)
Subject: Re: Explanation, please!
Message-ID: <1988Aug29.002101.213@utzoo.uucp>
Organization: U of Toronto Zoology
References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP>
Date: Mon, 29 Aug 88 00:21:01 GMT

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.

Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

The key point is not pipelining, but loop-control overhead.  There is in
fact a tradeoff here:  unrolling the loop further will reduce control
overhead further, but will increase code size.  That last is of some
significance when caching gets into the act:  cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu