Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!zaphod.mps.ohio-state.edu!rpi!batcomputer!munnari.oz.au!labtam!graeme From: graeme@labtam.labtam.oz (Graeme Gill) Newsgroups: comp.arch Subject: Re: loop unrolling (was:Re: Register Count) Summary: Assuming simple loop tests Message-ID: <5869@labtam.labtam.oz> Date: 18 Jan 91 02:28:35 GMT References: <11566@pt.cs.cmu.edu> Organization: Labtam Australia, Melbourne, Australia Lines: 37 In article , pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > > Unrolling only is of benefit if there are enough functional units/stages > of the pipe that a single iteration does keep all stages busy. In most > contemporary micrprocessor implementations that have three-four stage > pipelines, normally the computation in a single loop iteration PLUS the > control of the next iteration keeps all functional units busy. If your > implementation has greater internal parallelism, and your application > can take advantage of it, more power to you. If not, check your > assumptions, man :-). > > There seems to be an assumption here that the loop conditions are very simple - ie. decrement a loop counter, jump if non-zero etc. Many real problems involve loops with much more complicated tests. The tests may involve significant processing in themselves, and in this situation loop unrolling is the only way of reducing this overhead significantly. Delay slots etc. just don't give you enough to cover this up. I think it is misleading to only consider a sub-set of typical problems. Real-world code involves more than simple loops of logical or arithmetic processing of integer sized elements. Some problems involve data movement and processing of elements larger than integers. Talk of 4 registers being sufficient for 'typical' problems is a joke in many cases. For instance, copying a pattern into memory will double in speed if the pattern can be cached in registers. A pattern size may be of arbitrary length, and once the pattern size exceeds the register resources, the performance of the code must drop to a copy rather than fill speed. Loops involving modulus arithmetic to keep track of pattern repeat boundaries will involve several variables in themselves, and even a reduction of code performance of 10 - 20 % because the next to inner loop count variables are not in registers is un-acceptable in a competitive situation. Even with a wonderful compiler, 32 registers is rarely enough for some problems. Graeme Gill Electronic Design Engineer Labtam Australia