Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ames!mailrus!tut.cis.ohio-state.edu!ukma!rutgers!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.arch
Subject: Procedure Call Protocol
Message-ID: <7583@aw.sei.cmu.edu>
Date: 2 Nov 88 19:57:14 GMT
References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> <7580@aw.sei.cmu.edu>
Sender: netnews@sei.cmu.edu
Reply-To: firth@bd.sei.cmu.edu (Robert Firth)
Organization: Carnegie-Mellon University, SEI, Pgh, Pa
Lines: 153


	An Example Procedure Protocol
	-----------------------------

Readers of this newsgroup know that I'm rarely reticent about
criticising other people's hardware and software.  Here is your
chance to do me the same service.

The following is the procedure call, entry and exit protocol I devised
to support traditional "Algol-like" languages on the MIPS R2000 hardware.
These languages include Algol-60, BCPL, ISO Pascal, and Modula-2. From
the point of view of the procedural interface, the main features are

. procedures can recurse

. all local objects and function results have static size

. there are procedure variables and parametric procedures

. procedures can access non-local variables

. there are constructs that jump out of procedures

Now, not all the above languages have all these features.  So we must be
prepared to implement all of them, but if possible optimise them away
selectively.  I give first the "canonical" form, and then discuss how
much can be selectively removed.

Local variables are allocated from a stack.  The register RP is the
stack pointer (actual register numbers don't matter), and at any
time it designates the start of the stack frame of the current procedure.
The stack grows UPWARDS, and it is an open stack, ie there is no stack
front pointer and the CALLER moves RP so that it will be correct on
entry to the callee.

Parameters are passed in U0..U7, by value or reference as appropriate,
upto a maximum of 8 words.  After that, they are stored on the stack in
the place where the callee expects to find them.  Function results, if
small, are returned in <U0,U1>; if large, they are returned in an object
allocated by the caller and passed as an extra VAR parameter.

Non-local objects are accessed by following a static chain, the head of
which is passed by the caller.  A procedure value in general consists of
a pair (code address, static link), and that is how a parametric procedure
is passed.

A longjump is implemented by resetting RP and then taking the jump.  The
actual code of the BCPL "longjump" routine is

	move RP,U0	; reset RP from first parameter
	j U1		; jump to address passed as second parameter

and this is generated inline for things like Pascal GOTO.

Call Sequence
-------------

	; before you get here, save all live registers
	lw U0, first_parameter
	lw U1, second_parameter
	...
(A1)	lw RL, callee_address
(A2)	lw RE, static_link
(A3)	addi RP, frame_size	; raise stack pointer
(A4)	jal RL			; call procedure
(A5)	nop
(A6)	subi RP, frame_size	; lower stack pointer
	; expect result in U0

Entry Sequence
--------------

(B1)	sw RL,0(RP)		; save return link
(B2)	sw RE,4(RP)		; save static link
	sw U0,8(RP)		; save parameter 1
	...

Return Sequence
---------------

	; get result into U0
(C1)	lw RL,0(RP)		; restore return link
(C2)	nop
(C3)	j RL			; jump
(C4)	nop

The cost of a basic procedure call, entry and exit is therefore 12 cycles,
ignoring any parameters or result.  Not bad, when you reflect that the
Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison,
of course, since that machine is handicapped by having special call and
return instructions).

Let us now optimise this code.

1. How often do we need to compute the callee address?  Only if it is
   a procedure variable or a parametric procedure.  In the code I've seen,
   that is about 30% of the time, so deduct [0.7 * A1]

2. How often do we have to pass a static link?  Never in BCPL or Modula-2,
   of course.  But otherwise, only if the callee is a parametric procedure
   or an explicit inner procedure.  That happens less than 25% of the time,
   so we can deduct [0.75 * A2].  (Happily, procedure variables never need
   a static link in any of the above languages that allow them.)

3. Similarly, how often do we have to STORE a static link that has been
   passed us by the caller?  Only if we are, indeed, an inner procedure.
   That happens hardly ever - say 5% of the time - so deduct [0.95 * B2]
   (You must pass the link to a parametric procedure, since you don't
   know whether the actual is an inner.  But the actual rarely needs to
   store the link, since it DOES know whether it's inner.)

4. Now, how often do we have to raise and lower the stack?  Not once per
   call, clearly, because if we have two calls one after the other then
   the SUBI after the first cancels the ADDI before the second.  To cut
   a long story short, we need ~ 45 moves in 100 calls, so deduct 1.10
   instructions [0.55 * (A3+A6)]

5. What about that NOP after the call.  How often can we fill that?
   In general, almost always. The stack raise can go there (it has
   no delay and so RP will be valid at point B1).  The load of a
   parameter can go there, provided the callee is aware of the fact.
   Or an extraneous instruction can go there.  That adds up to over
   97% of the time, so I'm simply going to deduct A5 completely.

6. How often do we have to save and restore the return address?  Whenever
   we are a non-leaf procedure.  Dynamically, about 25% of calls are
   calls of leaf procedures, so deduct 0.5 [0.25 * (B1+C1)]

7. Finally, what about the two NOPs C2 and C4?  It is almost always possible
   to fill the first, for instance by the load of the result or the other
   last instruction of the routine.  The second is much harder.  I do not
   allow the load of the result to go there, since then U0 would be invalid
   immediately upon return.  The actual figure is under 10%, so I shall
   be pessimistic here and say that exactly one of the NOP instructions
   can be removed.

Adding that all up, I find that, of a canonical 12 cycles, near enough
half can be removed by straightforward local optimisation.  The
result is a protocol able to accommodate the languages given above,
at an average cost of SIX cycles per procedure call.  I'll offer two
reasons for this result

(a) a very clean, simple, and efficient basic machine design (thanks, MIPS)

(b) a protocol that has taken care to partition the work so that
    each piece is done by the party most able to optimise it.

This second reason is why I believe a special hardware instruction is
a bad idea.  Such an instruction must almost always implement almost
all of the protocol.  But if half the protocol can be optimised away,
then the instructoon, to win, must execute more than twice as fast as
the equivalent naive instructions.  I don't believe that can be done.

Thank you for your attention.