Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!mcsun!ukc!edcastle!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: loop unrolling (was:Re: Register Count)
Message-ID: <PCG.91Jan16195742@teacho.cs.aber.ac.uk>
Date: 16 Jan 91 19:57:42 GMT
References: <PCG.91Jan10162301@odin.cs.aber.ac.uk> <11566@pt.cs.cmu.edu>
	<PCG.91Jan13174042@odin.cs.aber.ac.uk>
	<1991Jan14.215401.19522@jetsun.weitek.COM>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 43
Nntp-Posting-Host: teacho
In-reply-to: gg@jetsun.weitek.COM's message of 14 Jan 91 21:54:01 GMT

On 14 Jan 91 21:54:01 GMT, gg@jetsun.weitek.COM said:

gg> In article <PCG.91Jan13174042@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
gg> (Piercarlo Grandi) writes:

pcg> If you have *some limited* degree of pipelining, as in contemporary
pcg> implementations, such as the classic three-four stage pipeline that
pcg> overlaps some computation with some control, and especially if this
pcg> pipeline is exposed with things like delayed branches, then
pcg> unrolling buys you nothing at all in time, and loses code space.

gg> On the contrary: it can give you bigger basic blocks in the critical
gg> loops, thus making more room for instruction scheduling to minimize
gg> delays.

Sorry, we are still on different planets: if this is of benefit it is
because the internal degree of parallelism is higher than what I have
assumed above. There is no need for instruction scheduling across loop
iterations if all the pipeline stages are kept busy with a single
iteration of the loop.

Unrolling only is of benefit if there are enough functional units/stages
of the pipe that a single iteration does keep all stages busy. In most
contemporary micrprocessor implementations that have three-four stage
pipelines, normally the computation in a single loop iteration PLUS the
control of the next iteration keeps all functional units busy. If your
implementation has greater internal parallelism, and your application
can take advantage of it, more power to you. If not, check your
assumptions, man :-).


At times this discussion reminds me of old discussions on the optimal
degree of multiprogramming, which is limited by (number of CPUs)+(number
of possibly outstanding IO operations) or something like that. For most
Unix machines in the PDP/VAX/SUN machines this made the optimal degree
of multiprogramming limited to something like 2-4 (the Law of Four).

I got the same type of reactions (no, it is 16 on my iPSC; no, it is
larger if you use a more parallelizing scheduler, ...). Bah! Pah!
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk