Xref: utzoo comp.lang.c:26926 comp.lang.misc:4460
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!aplcen!haven!mimsy!chris
From: chris@mimsy.umd.edu (Chris Torek)
Newsgroups: comp.lang.c,comp.lang.misc
Subject: Re: function calls
Message-ID: <23113@mimsy.umd.edu>
Date: 15 Mar 90 08:23:49 GMT
References: <1990Mar14.172152.25436@utzoo.uucp> <14268@lambda.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 81

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.

Right.

>2) Test the stack to make sure there is enough room for the callee's local
>   variables (and call the memory manager for a new stack frame if there
>   wasn't room).

True, but (as Jim Giles points out) VM machines can cheat.  (He does not
say how to cheat, but it is very easy: do not test.  Simply try to use the
new stack address(es).  If they are invalid, the kernel takes an access
fault and decides whether to grow the stack or to terminate the program.)

>3) Create (on the stack) a traceback entry so the debugger and/or the
>   postmortem analyzer can find the current thread through the call tree.

There is a way to cheat on this as well; MIPS use it in their compilers.
When a function does not need a frame pointer---most do not---it gets
instead a `virtual frame pointer' which is simply an offset from the
stack pointer.  The debugger reads the symbol table of the object file
to find out which register is the stack pointer, and what the offset from
this register is, so as to find the previous call (and any saved registers,
using a compiler-generated field also found in the object file).

>The problem is: _MOST_ of the procedure call overhead is concentrated in
>number (1)!  And, this overhead applies to all procedures - including
>'leaf' routines.

True; and this is where most of the effort at reducing function call
overhead must be applied.  Fortunately, some smart guys have figured
out some good methods to handle this.  On the MIPS, for instance, the
system designates some registers as `caller saves' and some registers
as `callee saves'.  A leaf procedure heads straight for the `caller
saves' registers; others head for the `callee saves' registers.

In other words, leaf routines can (provided they do not need a huge
number of registers) get away with saving and restoring nothing.  (They
need not even save and restore the stack pointer, if a `callee saves'
register is free: simply do something like

	add	#-40,standard_sp,temp_reg_3 // set temp_reg to sp-40

and then use temp_reg_3 as your `stack pointer'.

>Basically, 'leafness' has very little to do with procedure call overhead.

As shown above, this is no longer true.  If the leaf uses a `large'
number of registers (more than are available as temporary computation
registers in non-leaf routines), this statement holds; if not, the
fact that the routine is a leaf makes the registers `free'.

(Of course, callers that use lots of registers, and store things in
the temporary registers, must spill those registers on subroutine
calls.  This may be what Jim Giles meant all along.  Perhaps someone
at MIPS can post statistics as to how often this is the case.)

>Because modern machines tend to have a _large_ number of registers,
>implementing (1) is usually quite expensive - and it can't be made
>cheaper without interprocedural analysis which in turn can't be done as
>long as separate compilation is possible.

The only problem with this last statement (`interprocedural analysis
cannot be done due to separate compilation') is that someone already
does it---namely, MIPS do it at the highest compilation level.  Again,
one does it by cheating: `compile' to `Ucode', an intermediate level
code that is neither source nor object.  When `linking' the Ucode,
put everything together, do global flow analysis, optimise, and then
generate machine code.

(Of course, linking starts to get very slow, encouraging people to
run with lower optimisation levels.  This is more or less as it should
be, though: you choose the trade-off of compilation time for run time.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris