Path: utzoo!attcan!uunet!samsung!zaphod.mps.ohio-state.edu!mips!lloyd!cprice From: cprice@mips.COM (Charlie Price) Newsgroups: comp.arch Subject: Re: icache size (was Compilers taking advantage of architectural enhancements) Message-ID: <42415@mips.mips.COM> Date: 26 Oct 90 07:31:36 GMT References: <3300194@m.cs.uiuc.edu> <1990Oct12.042251.18884@cs.cmu.edu> <1990Oct12.135814.14346@zip.eecs.umich.edu> Sender: news@mips.COM Reply-To: cprice@mips.mips.COM (Charlie Price) Organization: MIPS Computer Systems Lines: 63 In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes: > >I don't understand why people are still consumed by code size in with (most) >aggressive loop optimizations. Loop unrolling and polycyclic scheduling >do increase code size. That just isn't important anymore. Give me a 4k >icache. Thats usually 1k instructions, right? I've been using the >Astronautics ZS-1, which almost always unrolls loops and does a very good >job of picking the 'correct' unrolling depth. Yet the loops I've looked >at are almost never expanded to over 100 instructions, let alone 1k. > >When you are unrolling loops, you only need a certain number of instructions >(dependant on fu and mem latencies) to work with. Even a small icache, >i.e. 1-4k, can hold the required number of instructions. > >Granted, this issue might still be important for modern cpus that still >have very small icaches, but they are quickly being replaced. I'll take a shot at this since I haven't seen anybody else try yet... There ain't no such thing as a free lunch. Unrolling a loop makes things faster by amortizing the per-loop loop test condition bookkeeping instruction(s) (like adds to addresses) and the conditional branches over more "useful" instructions. There are overhead costs for loop unrolling due to I-cache considerations and this *can* make things slower -- or at least change the ideas of good limits. Consider the following: 1) It costs some overhead to get instructions into cache. The larger the code loop, the higher the pure overhead cost fetching the instructions into the I-cache. As the processor becomes faster, the overhead per instruction almost certainly becomes relatively larger. The loop unrolling has to pay this direct overhead cost back before it even breaks even -- but you *guarantee* that you run through an unrolled loop fewer times the more it is unrolled. For the fast processor (with the smaller cache?) it takes more loop traversals just to break even. 2) A more-unrolled loop, being bigger, displaces more instructions from the I-cache than a less-unrolled loop. Some of those instructions were useful! That means that in addition to incurring a direct overhead cost by being fetched in the first place, an unrolled loop incurs an indirect cost by forcing the processor to re-fetch some instructions that would have stayed in cache otherwise. The savings needed to break even is going up still more. 3) Additional "real" memory traffic is nearly always bad in some limit. It makes easy-to-program machines like bus-based shared-memory machines poop out at a lower number of processors. Nothing seems to scale very well, loop unrolling that works nicely on one machine, may not be a win on something either 5 times faster or 5 times slower. Loop unrolling isn't necessarily as big a win as it might seem for all situations. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650