Path: utzoo!utgpu!attcan!uunet!seismo!sundc!pitstop!sun!decwrl!purdue!i.cc.purdue.edu!k.cc.purdue.edu!l.cc.purdue.edu!cik
From: cik@l.cc.purdue.edu (Herman Rubin)
Newsgroups: comp.arch
Subject: Re: register save/restore
Message-ID: <1009@l.cc.purdue.edu>
Date: 3 Nov 88 16:00:49 GMT
References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> <7580@aw.sei.cmu.edu>
Organization: Purdue University Statistics Department
Lines: 183

In article <7580@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
> 
> 	Register Saving across Procedure Calls
> 	--------------------------------------
> 
> Which is better - caller saves or callee saves?
> 
> 
> A. Is this the right question?

It is, if library subroutines are to be used.  A compiler cannot change
the procedure to be followed in this case.

			..................

| B. Are the strategies equally sound?
| 
| The point I consider most important, is that there is a definite
| semantic asymmetry between the two strategies.  If the caller saves,
| then the caller is saving, locally, his own local state.  This seems
| to me basically correct.  If the callee saves, then the callee is
| saving, local to him, state that belongs to someone else.  Moreover,
| he is saving state of greater extent - the caller's registers - in
| space of lesser extent - his own stack frame.  This seems to me
| semantically unsound.
| 
| Now, I tend to let few things get in the way of efficiency, especially
| efficiency of something as crucial as the procedure call, but semantic
| correctness is one of those things.  So in this case, I'm going to
| come out and say that "callee saves" is fundamentally wrong, and should
| be avoided if possible, even at some cost.  
| 
| 
| C. Which is more efficient?
| 
| Happily, however, the efficiency arguments, in my experience, support
| the "caller saves" strategy, so one can indeed do well by doing good.
| 
| The most blatant case is that of the longjump, which appears in other
| languages as a GOTO or RAISE statement.  This causes a jump out of a
| procedure to somewhere further up the call chain, and so must reset
| the environment of the destination.  If the caller saves state, then
| this is simple: the jump is a jump, and the destination knows where
| all the state has been saved.  In most implementations, one need only
| reset the frame pointer to the current incarnation of the destination
| procedure, and take the jump to the label.
| 
| But if the callee saves, then the caller has no idea how to recover
| his saved state, which may be buried any number of stack frames further
| down.  It is therefore necessary to unwind the entire stack before taking
| the jump.  The difference in cost can easily be a factor of 100 or more.

 The only problem is in a dropback over several calls; it would then be
necessary to have the callee place the information about what was saved and
where so the caller could find it quickly.  Possibly this should be use
instead of an automatic restore on return.

> I do not regard this as a marginal point.  The exception is beginning to
> be used as a normal programming tool; it is a feature of several modern
> languages and will probably be a feature of most new ones.  Its efficient
> implementation is as desirable as the efficient implementation of, say,
> for-loops or array assignments.
> 
> Turning though to the main topic - which is the faster strategy for a
> normal call and return - I see two issues here: the number of registers
> to be saved and restored, and the cost of each save and restore.
> 
> 
> D. Some facts and guesses
> 
> In my experience, there is almost no difference between the number
> of registers used by the caller and the number used by the callee.
> 
| Small procedures tend to use fewer than less small, and leaf procedures
| tend to be a bit smaller, so on balance it seems marginally better for
| the callee to save. (What this also tells us is that interprocedural
| optimisation of leaves and leaf-callers only will give you big returns)
| 
| But this is outweighed by two factors
| 
| * The callee must save all registers it will use throughout the body;
|   the caller need save only the registers that are live at the point of
|   call.
| 
| * When two or more calls occur in succession, both callees must save,
|   but the caller need save only once.
| 
| Rough guesses I have accumulated over time are
| 
> * at any call point, caller is using ~ 2/3 of the registers it will
>   use at all (though this is partly due to defects in register
>   allocation strategies)
> 
> * on average, a procedure call is (almost) immediately followed by
>   another about 2/3 of the time. This implies that if the caller
>   saves, it will have to save ~ 45 times for every 100 calls.
> 
> These two factors together imply that the cost of a caller-saves protocol
> is about 1/3 that of a callee-saves protocol.  (Do you believe that?)
> 
No.  My experience is quite different.  The cases where I do this are because
the callee does not save; if I have a conditional subroutine call (not that
uncommon) it may be necessary to do a save before the call--the save is the
second call.

> Now consider the cost of a save and restore.  There are two factors that
> make it cheaper for the caller to save
> 
> * the register may be slaving a known value.  The caller then need not
>   save at all, merely restore.  I find this is true of at least 20% of
>   live registers. (Consider for instance the MC68020.  If you are working
>   hard, you probably have 5 or 6 live D registers, of which at least one
>   is holding a constant, and 4 or 5 live A registers, of which perhaps
>   two are holding pointers to data structures.  That's 3 out of 10)
> 
> * the store may be combined with another operation.  For example, the
>   last operation on the register may have been an add
> 
>     ADDL2 X,R1
> 
>   This can be changed into
> 
>     ADDL3 X,R1,save_place
> 
>   A small saving, admittedly, but a saving.
> 
> There is a third factor I have not assessed satisfactorily, which applies
> especially to RISC machines.  Is the caller or the callee better able to
> distribute loads and stores through the code, so as to overlap any load
> or store delays?  I suspect it is the callee, but that is a hunch.

This assumes that there are only a few registers.  Try this on a machine
with many registers, such as the CYBER205 with 256 registers.  And consider
the problem on a machine with vector registers.  In most cases, most of the
vector registers will be in use; I believe they all do not have nearly 
enough.  This is likely to be at least 4096 bytes; the number of words 
depends on the word length.  The problem is that any convention may or
may not be right for the given application.

> 
> E. What about Hardware Help?
> 
> On the question of a "high-level procedure call" implemented in hardware,
> such as the VAX CALLS, I think my opinion is known: such instructions
> are worthless.  But what about a simpler instruction, such as a hardware
> register save, or save-under-mask? 

The VAX has register save (and restore) under mask.

			....................

Probably the best help that hardware can give is to have a "dirty" bit
or a mask on call as to what it may be necessary to save.  The bit is
cleared on saving and reset on restoring.  If a mask is used, the subroutine
would remove the bits corresponding to saved registers to recompute its
mask.  The problem with a mask is that if there are a large number of 
registers, the mask is long.

It seems from the references above, that the correspondent is ignoring the
saving of flaoting-point registers where they are separate.  They fall into
the same category.  It is much less likely that floating point registers
will be of the restore only type than pointers.
> 
> F. Possibly Offensive Remark
> 
> I agree with Henry, that procedure calling protocols need to be thought
> afresh for each fresh machine.  Unfortunately, very few compiler shops
> seem prepared to do this.  I see over and again the model of a closed,
> downward-growing stack; caller pushes parameters; callee saves registers
> and allocates local space.  One can do better than this by a factor of
> two or three, on almost any machine I know, with a different model more
> amenable to local optimisation.

			......................

It is necessary to consider the problem for each machine.  And with planned
exceptions, and especially if an interrupt (and we should also have programmed
interrupts on conditions, instead of having to test at each occasion), I see
no good alternative to having the callee save.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)