Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ames!nrl-cmf!cmcl2!rutgers!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.arch
Subject: Re: register save/restore
Message-ID: <7580@aw.sei.cmu.edu>
Date: 2 Nov 88 18:14:29 GMT
References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US>
Sender: netnews@sei.cmu.edu
Reply-To: firth@bd.sei.cmu.edu (Robert Firth)
Organization: Carnegie-Mellon University, SEI, Pgh, Pa
Lines: 156


	Register Saving across Procedure Calls
	--------------------------------------

Which is better - caller saves or callee saves?


A. Is this the right question?

First, and most important, if you are designing a professional-quality
production compiler, this is the wrong question.  Such a compiler must
perform interprocedural optimisation if it is to be respectably state
of the art.  

However, if you want to design a prototype, amateur, or deliberately
low-cost compiler, the issue is probably one worth considering.  To
keep this note short, I'm going to assume you understand the basic
issue and are familiar with current hardware and software technology.


B. Are the strategies equally sound?

The point I consider most important, is that there is a definite
semantic asymmetry between the two strategies.  If the caller saves,
then the caller is saving, locally, his own local state.  This seems
to me basically correct.  If the callee saves, then the callee is
saving, local to him, state that belongs to someone else.  Moreover,
he is saving state of greater extent - the caller's registers - in
space of lesser extent - his own stack frame.  This seems to me
semantically unsound.

Now, I tend to let few things get in the way of efficiency, especially
efficiency of something as crucial as the procedure call, but semantic
correctness is one of those things.  So in this case, I'm going to
come out and say that "callee saves" is fundamentally wrong, and should
be avoided if possible, even at some cost.  


C. Which is more efficient?

Happily, however, the efficiency arguments, in my experience, support
the "caller saves" strategy, so one can indeed do well by doing good.

The most blatant case is that of the longjump, which appears in other
languages as a GOTO or RAISE statement.  This causes a jump out of a
procedure to somewhere further up the call chain, and so must reset
the environment of the destination.  If the caller saves state, then
this is simple: the jump is a jump, and the destination knows where
all the state has been saved.  In most implementations, one need only
reset the frame pointer to the current incarnation of the destination
procedure, and take the jump to the label.

But if the callee saves, then the caller has no idea how to recover
his saved state, which may be buried any number of stack frames further
down.  It is therefore necessary to unwind the entire stack before taking
the jump.  The difference in cost can easily be a factor of 100 or more.

I do not regard this as a marginal point.  The exception is beginning to
be used as a normal programming tool; it is a feature of several modern
languages and will probably be a feature of most new ones.  Its efficient
implementation is as desirable as the efficient implementation of, say,
for-loops or array assignments.

Turning though to the main topic - which is the faster strategy for a
normal call and return - I see two issues here: the number of registers
to be saved and restored, and the cost of each save and restore.


D. Some facts and guesses

In my experience, there is almost no difference between the number
of registers used by the caller and the number used by the callee.

Small procedures tend to use fewer than less small, and leaf procedures
tend to be a bit smaller, so on balance it seems marginally better for
the callee to save. (What this also tells us is that interprocedural
optimisation of leaves and leaf-callers only will give you big returns)

But this is outweighed by two factors

* The callee must save all registers it will use throughout the body;
  the caller need save only the registers that are live at the point of
  call.

* When two or more calls occur in succession, both callees must save,
  but the caller need save only once.

Rough guesses I have accumulated over time are

* at any call point, caller is using ~ 2/3 of the registers it will
  use at all (though this is partly due to defects in register
  allocation strategies)

* on average, a procedure call is (almost) immediately followed by
  another about 2/3 of the time. This implies that if the caller
  saves, it will have to save ~ 45 times for every 100 calls.

These two factors together imply that the cost of a caller-saves protocol
is about 1/3 that of a callee-saves protocol.  (Do you believe that?)

Now consider the cost of a save and restore.  There are two factors that
make it cheaper for the caller to save

* the register may be slaving a known value.  The caller then need not
  save at all, merely restore.  I find this is true of at least 20% of
  live registers. (Consider for instance the MC68020.  If you are working
  hard, you probably have 5 or 6 live D registers, of which at least one
  is holding a constant, and 4 or 5 live A registers, of which perhaps
  two are holding pointers to data structures.  That's 3 out of 10)

* the store may be combined with another operation.  For example, the
  last operation on the register may have been an add

    ADDL2 X,R1

  This can be changed into

    ADDL3 X,R1,save_place

  A small saving, admittedly, but a saving.

There is a third factor I have not assessed satisfactorily, which applies
especially to RISC machines.  Is the caller or the callee better able to
distribute loads and stores through the code, so as to overlap any load
or store delays?  I suspect it is the callee, but that is a hunch.


E. What about Hardware Help?

On the question of a "high-level procedure call" implemented in hardware,
such as the VAX CALLS, I think my opinion is known: such instructions
are worthless.  But what about a simpler instruction, such as a hardware
register save, or save-under-mask?  A good example is provided by the
MOVEM instruction of the MC68020, or the LDM/STM of the PE3200.  These
are typical, and typically their break-even point is at about 4
registers.  This suggests to me that they are or marginal value - after
all, if you are already passing parameters in registers, how many are
left to save around the typical call?

The instructions are probably still worth having for things like a task
context save (if only the PE3200 permitted them to be used that way!),
but their contribution to a procedure protocol is slight.


F. Possibly Offensive Remark

I agree with Henry, that procedure calling protocols need to be thought
afresh for each fresh machine.  Unfortunately, very few compiler shops
seem prepared to do this.  I see over and again the model of a closed,
downward-growing stack; caller pushes parameters; callee saves registers
and allocates local space.  One can do better than this by a factor of
two or three, on almost any machine I know, with a different model more
amenable to local optimisation.

We are used to making up in software for badly designed hardware.  Now
we have the reverse: making up in hardware for badly designed software.
At least, that is my explanation for register windows.