Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull
From: amull@Morgan.COM (Andrew P. Mullhaupt)
Newsgroups: comp.arch
Subject: Re: Inlining subroutines at link time
Message-ID: <1185@s8.Morgan.COM>
Date: 5 Jul 90 15:13:08 GMT
References: <9629@brazos.Rice.edu>
Organization: Morgan Stanley & Co. NY, NY
Lines: 46

In article <9629@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes:
> Replying to many people at once...
> 
> Riordan asked opinions on inlining at link time.
> Since it's at link time, I assume (!) the linker simply substitutes
> the procedure body (minus the return) for the call instruction.
> Hence, 2 instructions saved.  Arguments would still be passed
> in registers or on the stack and registers would be saved and restored
> as usual.  More complex schemes would (possibly) be too complex and
> expensive for link-time.  At any rate, that's all that was promised
> by his manual.

Well if you're not going to save the argument passing, then the thing
is less useful, but I don't see why in FORTRAN, where you know how
your arguments look on the stack, the linker couldn't easily do a lot
more for you than just save the call and return.

> 
> Paging effects and I-cache effects are difficult to determine in
> advance.  You get better locality with inlining, but you also get
> bulkier code.  Consider also the effect of inlining a very small routine
> (a favorite choice).  With the non-inlined version, the routine might
> remain in cache (or paged in) for extended periods, perhaps being
> called from many different call sites.  In the inlined case,
> each call site would have it's own copy and require more cache
> lines or pages.

It isn't this simple at all. If the program is running on a multitasking
system, your routine might get dirty from part of the OS; in which case
the very small routine always gets pulled in as a bunch of cache misses.
Oppose this with the inlined version, where the routine occurs at many
different addresses, but almost always the subsequent addresses to what
you are executing. In this case, the part of the OS which stomps the
outline routine can only sometimes get you - so you see better overall
performance. Now ask yourself; what's the one behavior for programs 
which I-cache designers _must_ expect? Sure, it's the one where you do
an instruction, and then the next one, and then the one after that...
and so your best hope of good performance across more than one machine
is to inline small routines. Now, many inlining techniques actually
make more informed decisions about what to inline, and what not, but
in the zeroth order case (inline all calls to 'X' or none of 'em) then
the simple answer is inline 'X' if it is smaller than some size and not
if it's bigger. 

Later,
Andrew Mullhaupt