Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!amdcad!ames!sdcsvax!ucbvax!utzoo.UUCP!henry
From: henry@utzoo.UUCP
Newsgroups: comp.protocols.tcp-ip
Subject: Re: TCP performance limitations
Message-ID: <8710230618.AA23418@ucbvax.Berkeley.EDU>
Date: Fri, 23-Oct-87 01:43:18 EST
Article-I.D.: ucbvax.8710230618.AA23418
Posted: Fri Oct 23 01:43:18 1987
Date-Received: Sun, 25-Oct-87 13:38:40 EST
References: <1218@nrcvax.UUCP> <871013100846.2.DCP@KOYAANISQATSI.S4CC.Symbolics.COM>
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 19

> ... Fourth, depending on your CPU architecture, there
> may be ideal unrolling constants which would keep the unrolled loop
> inside an instruction prefetch buffer; complete unrolling would actually
> be a degredation.

In particular, almost any CPU with a cache -- which means most anything above
the PC level nowadays -- will have an optimum degree of unrolling for loops
that iterate a given number of times.  It's not just a question of whether
the loop will fit; eventually the extra main-memory fetches needed to get
a larger loop into the cache wipe out the gains from reduced loop-control
overhead.  For straightforward caches (with a loop that will *fit* in the
cache!), elapsed time versus degree of unrolling is a nice smooth curve with
a quite marked minimum.  Based on the look I took at this, if the ratio of
your cache speed to memory speed isn't striking, and your loop control is
not grossly costly (due to e.g. pipeline breaks), the minimum has a good
chance of falling at a fairly modest unrolling factor, maybe 8 or 16.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry