Path: utzoo!attcan!uunet!samsung!zaphod.mps.ohio-state.edu!mips!lloyd!cprice
From: cprice@mips.COM (Charlie Price)
Newsgroups: comp.arch
Subject: Re: icache size (was Compilers taking advantage of architectural enhancements)
Message-ID: <42415@mips.mips.COM>
Date: 26 Oct 90 07:31:36 GMT
References: <3300194@m.cs.uiuc.edu> <AGLEW.90Oct11222801@treflan.crhc.uiuc.edu> <1990Oct12.042251.18884@cs.cmu.edu> <1990Oct12.135814.14346@zip.eecs.umich.edu>
Sender: news@mips.COM
Reply-To: cprice@mips.mips.COM (Charlie Price)
Organization: MIPS Computer Systems
Lines: 63

In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes:
>
>I don't understand why people are still consumed by code size in with (most)
>aggressive loop optimizations.  Loop unrolling and polycyclic scheduling
>do increase code size.  That just isn't important anymore.  Give me a 4k
>icache.  Thats usually 1k instructions, right?  I've been using the 
>Astronautics ZS-1, which almost always unrolls loops and does a very good
>job of picking the 'correct' unrolling depth.  Yet the loops I've looked
>at are almost never expanded to over 100 instructions, let alone 1k.
>
>When you are unrolling loops, you only need a certain number of instructions
>(dependant on fu and mem latencies) to work with.  Even a small icache,
>i.e. 1-4k, can hold the required number of instructions.
>
>Granted, this issue might still be important for modern cpus that still
>have very small icaches, but they are quickly being replaced.

I'll take a shot at this since I haven't seen anybody else try yet...

There ain't no such thing as a free lunch.

Unrolling a loop makes things faster by amortizing the per-loop 
loop test condition bookkeeping instruction(s) (like adds to addresses)
and the conditional branches over more "useful" instructions.
There are overhead costs for loop unrolling due to I-cache
considerations and this *can* make things slower -- or at least
change the ideas of good limits.

Consider the following:

1)  It costs some overhead to get instructions into cache.
    The larger the code loop, the higher the pure overhead
    cost fetching the instructions into the I-cache.
    As the processor becomes faster, the overhead per instruction
    almost certainly becomes relatively larger.
    The loop unrolling has to pay this direct overhead cost
    back before it even breaks even -- but you *guarantee* that
    you run through an unrolled loop fewer times the more it
    is unrolled.
    For the fast processor (with the smaller cache?) it takes more
    loop traversals just to break even.
2)  A more-unrolled loop, being bigger, displaces more instructions
    from the I-cache than a less-unrolled loop.
    Some of those instructions were useful!
    That means that in addition to incurring a direct overhead cost
    by being fetched in the first place, an unrolled loop incurs an
    indirect cost by forcing the processor to re-fetch some
    instructions that would have stayed in cache otherwise.
    The savings needed to break even is going up still more.
3)  Additional "real" memory traffic is nearly always bad
    in some limit.  It makes easy-to-program machines like
    bus-based shared-memory machines poop out at a lower number
    of processors.

Nothing seems to scale very well, loop unrolling that works
nicely on one machine, may not be a win on something either 5 times
faster or 5 times slower.

Loop unrolling isn't necessarily as big a win
as it might seem for all situations.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650