Xref: utzoo comp.lang.c:26952 comp.lang.misc:4474
Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!yale!cmcl2!lanl!lambda!jlg
From: jlg@lambda.UUCP (Jim Giles)
Newsgroups: comp.lang.c,comp.lang.misc
Subject: Re: function calls
Message-ID: <14271@lambda.UUCP>
Date: 15 Mar 90 23:45:56 GMT
References: <23113@mimsy.umd.edu>
Lines: 51

From article <23113@mimsy.umd.edu>, by chris@mimsy.umd.edu (Chris Torek):
> [... 'caller-save' vs. 'callee-save' registers ...]
> As shown above, this is no longer true.  If the leaf uses a `large'
> number of registers (more than are available as temporary computation
> registers in non-leaf routines), this statement holds; if not, the
> fact that the routine is a leaf makes the registers `free'.
> 
> (Of course, callers that use lots of registers, and store things in
> the temporary registers, must spill those registers on subroutine
> calls.  This may be what Jim Giles meant all along.  Perhaps someone
> at MIPS can post statistics as to how often this is the case.)

This is exactly what I meant.  The Cray system has a similar mechanism
(and, in fact they even have special types of 'leaf' procedures called
'baselevel' routines).  The problem is that the 'caller' routine still
needs to save 'live' values around calls because the registers assigned
to the 'callee' are nearly always in use.

When I write in assembly, I tend to use all the registers I can in order
to avoid the memory overhead - memory costs about a dozen clocks per
reference while transfers to the temp regs only costs one.  Even with
memory pipelined and running in parallel with other functional units,
this extra delay is expensive.  If I were writing a compiler, I would
be similarly greedy with the registers for generated code.

All this trouble could be avoided if the register use of the 'callee'
were known in advance.  Then the code generator for the 'caller' could
do register scheduling with this extra information in mind.  Still
causes problems if the 'callee' uses a _lot_ of registers, but it's
better than nothing.  Of course, the best deal (if speed were para-
mount) would be to 'inline' the 'callee' completely.  Then the register
scheduling would take place across the call boundary (and the save/
restore could be hidden better under pipelining).

> [...]
> The only problem with this last statement (`interprocedural analysis
> cannot be done due to separate compilation') is that someone already
> does it---namely, MIPS do it at the highest compilation level.  Again,
> one does it by cheating: `compile' to `Ucode', an intermediate level
> code that is neither source nor object.  When `linking' the Ucode,
> put everything together, do global flow analysis, optimise, and then
> generate machine code.

I've often thought that code generation should be done by the loader for
this very reason.  Both inlining and regester scheduling across calls
would be improvements that would be worth the loader slowdown.  In
addition, the compile step would be considerably faster.  This means
that syntax checking would be a breeze (a common use of the compiler,
like it or not, is as a form of 'lint' - at least for non-C languages).

J. Giles