Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ames!mailrus!tut.cis.ohio-state.edu!ukma!rutgers!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.arch Subject: Procedure Call Protocol Message-ID: <7583@aw.sei.cmu.edu> Date: 2 Nov 88 19:57:14 GMT References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> <7580@aw.sei.cmu.edu> Sender: netnews@sei.cmu.edu Reply-To: firth@bd.sei.cmu.edu (Robert Firth) Organization: Carnegie-Mellon University, SEI, Pgh, Pa Lines: 153 An Example Procedure Protocol ----------------------------- Readers of this newsgroup know that I'm rarely reticent about criticising other people's hardware and software. Here is your chance to do me the same service. The following is the procedure call, entry and exit protocol I devised to support traditional "Algol-like" languages on the MIPS R2000 hardware. These languages include Algol-60, BCPL, ISO Pascal, and Modula-2. From the point of view of the procedural interface, the main features are . procedures can recurse . all local objects and function results have static size . there are procedure variables and parametric procedures . procedures can access non-local variables . there are constructs that jump out of procedures Now, not all the above languages have all these features. So we must be prepared to implement all of them, but if possible optimise them away selectively. I give first the "canonical" form, and then discuss how much can be selectively removed. Local variables are allocated from a stack. The register RP is the stack pointer (actual register numbers don't matter), and at any time it designates the start of the stack frame of the current procedure. The stack grows UPWARDS, and it is an open stack, ie there is no stack front pointer and the CALLER moves RP so that it will be correct on entry to the callee. Parameters are passed in U0..U7, by value or reference as appropriate, upto a maximum of 8 words. After that, they are stored on the stack in the place where the callee expects to find them. Function results, if small, are returned in ; if large, they are returned in an object allocated by the caller and passed as an extra VAR parameter. Non-local objects are accessed by following a static chain, the head of which is passed by the caller. A procedure value in general consists of a pair (code address, static link), and that is how a parametric procedure is passed. A longjump is implemented by resetting RP and then taking the jump. The actual code of the BCPL "longjump" routine is move RP,U0 ; reset RP from first parameter j U1 ; jump to address passed as second parameter and this is generated inline for things like Pascal GOTO. Call Sequence ------------- ; before you get here, save all live registers lw U0, first_parameter lw U1, second_parameter ... (A1) lw RL, callee_address (A2) lw RE, static_link (A3) addi RP, frame_size ; raise stack pointer (A4) jal RL ; call procedure (A5) nop (A6) subi RP, frame_size ; lower stack pointer ; expect result in U0 Entry Sequence -------------- (B1) sw RL,0(RP) ; save return link (B2) sw RE,4(RP) ; save static link sw U0,8(RP) ; save parameter 1 ... Return Sequence --------------- ; get result into U0 (C1) lw RL,0(RP) ; restore return link (C2) nop (C3) j RL ; jump (C4) nop The cost of a basic procedure call, entry and exit is therefore 12 cycles, ignoring any parameters or result. Not bad, when you reflect that the Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison, of course, since that machine is handicapped by having special call and return instructions). Let us now optimise this code. 1. How often do we need to compute the callee address? Only if it is a procedure variable or a parametric procedure. In the code I've seen, that is about 30% of the time, so deduct [0.7 * A1] 2. How often do we have to pass a static link? Never in BCPL or Modula-2, of course. But otherwise, only if the callee is a parametric procedure or an explicit inner procedure. That happens less than 25% of the time, so we can deduct [0.75 * A2]. (Happily, procedure variables never need a static link in any of the above languages that allow them.) 3. Similarly, how often do we have to STORE a static link that has been passed us by the caller? Only if we are, indeed, an inner procedure. That happens hardly ever - say 5% of the time - so deduct [0.95 * B2] (You must pass the link to a parametric procedure, since you don't know whether the actual is an inner. But the actual rarely needs to store the link, since it DOES know whether it's inner.) 4. Now, how often do we have to raise and lower the stack? Not once per call, clearly, because if we have two calls one after the other then the SUBI after the first cancels the ADDI before the second. To cut a long story short, we need ~ 45 moves in 100 calls, so deduct 1.10 instructions [0.55 * (A3+A6)] 5. What about that NOP after the call. How often can we fill that? In general, almost always. The stack raise can go there (it has no delay and so RP will be valid at point B1). The load of a parameter can go there, provided the callee is aware of the fact. Or an extraneous instruction can go there. That adds up to over 97% of the time, so I'm simply going to deduct A5 completely. 6. How often do we have to save and restore the return address? Whenever we are a non-leaf procedure. Dynamically, about 25% of calls are calls of leaf procedures, so deduct 0.5 [0.25 * (B1+C1)] 7. Finally, what about the two NOPs C2 and C4? It is almost always possible to fill the first, for instance by the load of the result or the other last instruction of the routine. The second is much harder. I do not allow the load of the result to go there, since then U0 would be invalid immediately upon return. The actual figure is under 10%, so I shall be pessimistic here and say that exactly one of the NOP instructions can be removed. Adding that all up, I find that, of a canonical 12 cycles, near enough half can be removed by straightforward local optimisation. The result is a protocol able to accommodate the languages given above, at an average cost of SIX cycles per procedure call. I'll offer two reasons for this result (a) a very clean, simple, and efficient basic machine design (thanks, MIPS) (b) a protocol that has taken care to partition the work so that each piece is done by the party most able to optimise it. This second reason is why I believe a special hardware instruction is a bad idea. Such an instruction must almost always implement almost all of the protocol. But if half the protocol can be optimised away, then the instructoon, to win, must execute more than twice as fast as the equivalent naive instructions. I don't believe that can be done. Thank you for your attention.