Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull From: amull@Morgan.COM (Andrew P. Mullhaupt) Newsgroups: comp.arch Subject: Re: Inlining subroutines at link time Message-ID: <1185@s8.Morgan.COM> Date: 5 Jul 90 15:13:08 GMT References: <9629@brazos.Rice.edu> Organization: Morgan Stanley & Co. NY, NY Lines: 46 In article <9629@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes: > Replying to many people at once... > > Riordan asked opinions on inlining at link time. > Since it's at link time, I assume (!) the linker simply substitutes > the procedure body (minus the return) for the call instruction. > Hence, 2 instructions saved. Arguments would still be passed > in registers or on the stack and registers would be saved and restored > as usual. More complex schemes would (possibly) be too complex and > expensive for link-time. At any rate, that's all that was promised > by his manual. Well if you're not going to save the argument passing, then the thing is less useful, but I don't see why in FORTRAN, where you know how your arguments look on the stack, the linker couldn't easily do a lot more for you than just save the call and return. > > Paging effects and I-cache effects are difficult to determine in > advance. You get better locality with inlining, but you also get > bulkier code. Consider also the effect of inlining a very small routine > (a favorite choice). With the non-inlined version, the routine might > remain in cache (or paged in) for extended periods, perhaps being > called from many different call sites. In the inlined case, > each call site would have it's own copy and require more cache > lines or pages. It isn't this simple at all. If the program is running on a multitasking system, your routine might get dirty from part of the OS; in which case the very small routine always gets pulled in as a bunch of cache misses. Oppose this with the inlined version, where the routine occurs at many different addresses, but almost always the subsequent addresses to what you are executing. In this case, the part of the OS which stomps the outline routine can only sometimes get you - so you see better overall performance. Now ask yourself; what's the one behavior for programs which I-cache designers _must_ expect? Sure, it's the one where you do an instruction, and then the next one, and then the one after that... and so your best hope of good performance across more than one machine is to inline small routines. Now, many inlining techniques actually make more informed decisions about what to inline, and what not, but in the zeroth order case (inline all calls to 'X' or none of 'em) then the simple answer is inline 'X' if it is smaller than some size and not if it's bigger. Later, Andrew Mullhaupt