Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!mcsun!ukc!edcastle!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: loop unrolling (was:Re: Register Count) Message-ID: Date: 16 Jan 91 19:57:42 GMT References: <11566@pt.cs.cmu.edu> <1991Jan14.215401.19522@jetsun.weitek.COM> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 43 Nntp-Posting-Host: teacho In-reply-to: gg@jetsun.weitek.COM's message of 14 Jan 91 21:54:01 GMT On 14 Jan 91 21:54:01 GMT, gg@jetsun.weitek.COM said: gg> In article pcg@cs.aber.ac.uk gg> (Piercarlo Grandi) writes: pcg> If you have *some limited* degree of pipelining, as in contemporary pcg> implementations, such as the classic three-four stage pipeline that pcg> overlaps some computation with some control, and especially if this pcg> pipeline is exposed with things like delayed branches, then pcg> unrolling buys you nothing at all in time, and loses code space. gg> On the contrary: it can give you bigger basic blocks in the critical gg> loops, thus making more room for instruction scheduling to minimize gg> delays. Sorry, we are still on different planets: if this is of benefit it is because the internal degree of parallelism is higher than what I have assumed above. There is no need for instruction scheduling across loop iterations if all the pipeline stages are kept busy with a single iteration of the loop. Unrolling only is of benefit if there are enough functional units/stages of the pipe that a single iteration does keep all stages busy. In most contemporary micrprocessor implementations that have three-four stage pipelines, normally the computation in a single loop iteration PLUS the control of the next iteration keeps all functional units busy. If your implementation has greater internal parallelism, and your application can take advantage of it, more power to you. If not, check your assumptions, man :-). At times this discussion reminds me of old discussions on the optimal degree of multiprogramming, which is limited by (number of CPUs)+(number of possibly outstanding IO operations) or something like that. For most Unix machines in the PDP/VAX/SUN machines this made the optimal degree of multiprogramming limited to something like 2-4 (the Law of Four). I got the same type of reactions (no, it is 16 on my iPSC; no, it is larger if you use a more parallelizing scheduler, ...). Bah! Pah! -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk