Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ames!nrl-cmf!cmcl2!rutgers!rochester!pt.cs.cmu.edu!sei!sei.cmu.edu!firth From: firth@sei.cmu.edu (Robert Firth) Newsgroups: comp.arch Subject: Re: register save/restore Message-ID: <7580@aw.sei.cmu.edu> Date: 2 Nov 88 18:14:29 GMT References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> Sender: netnews@sei.cmu.edu Reply-To: firth@bd.sei.cmu.edu (Robert Firth) Organization: Carnegie-Mellon University, SEI, Pgh, Pa Lines: 156 Register Saving across Procedure Calls -------------------------------------- Which is better - caller saves or callee saves? A. Is this the right question? First, and most important, if you are designing a professional-quality production compiler, this is the wrong question. Such a compiler must perform interprocedural optimisation if it is to be respectably state of the art. However, if you want to design a prototype, amateur, or deliberately low-cost compiler, the issue is probably one worth considering. To keep this note short, I'm going to assume you understand the basic issue and are familiar with current hardware and software technology. B. Are the strategies equally sound? The point I consider most important, is that there is a definite semantic asymmetry between the two strategies. If the caller saves, then the caller is saving, locally, his own local state. This seems to me basically correct. If the callee saves, then the callee is saving, local to him, state that belongs to someone else. Moreover, he is saving state of greater extent - the caller's registers - in space of lesser extent - his own stack frame. This seems to me semantically unsound. Now, I tend to let few things get in the way of efficiency, especially efficiency of something as crucial as the procedure call, but semantic correctness is one of those things. So in this case, I'm going to come out and say that "callee saves" is fundamentally wrong, and should be avoided if possible, even at some cost. C. Which is more efficient? Happily, however, the efficiency arguments, in my experience, support the "caller saves" strategy, so one can indeed do well by doing good. The most blatant case is that of the longjump, which appears in other languages as a GOTO or RAISE statement. This causes a jump out of a procedure to somewhere further up the call chain, and so must reset the environment of the destination. If the caller saves state, then this is simple: the jump is a jump, and the destination knows where all the state has been saved. In most implementations, one need only reset the frame pointer to the current incarnation of the destination procedure, and take the jump to the label. But if the callee saves, then the caller has no idea how to recover his saved state, which may be buried any number of stack frames further down. It is therefore necessary to unwind the entire stack before taking the jump. The difference in cost can easily be a factor of 100 or more. I do not regard this as a marginal point. The exception is beginning to be used as a normal programming tool; it is a feature of several modern languages and will probably be a feature of most new ones. Its efficient implementation is as desirable as the efficient implementation of, say, for-loops or array assignments. Turning though to the main topic - which is the faster strategy for a normal call and return - I see two issues here: the number of registers to be saved and restored, and the cost of each save and restore. D. Some facts and guesses In my experience, there is almost no difference between the number of registers used by the caller and the number used by the callee. Small procedures tend to use fewer than less small, and leaf procedures tend to be a bit smaller, so on balance it seems marginally better for the callee to save. (What this also tells us is that interprocedural optimisation of leaves and leaf-callers only will give you big returns) But this is outweighed by two factors * The callee must save all registers it will use throughout the body; the caller need save only the registers that are live at the point of call. * When two or more calls occur in succession, both callees must save, but the caller need save only once. Rough guesses I have accumulated over time are * at any call point, caller is using ~ 2/3 of the registers it will use at all (though this is partly due to defects in register allocation strategies) * on average, a procedure call is (almost) immediately followed by another about 2/3 of the time. This implies that if the caller saves, it will have to save ~ 45 times for every 100 calls. These two factors together imply that the cost of a caller-saves protocol is about 1/3 that of a callee-saves protocol. (Do you believe that?) Now consider the cost of a save and restore. There are two factors that make it cheaper for the caller to save * the register may be slaving a known value. The caller then need not save at all, merely restore. I find this is true of at least 20% of live registers. (Consider for instance the MC68020. If you are working hard, you probably have 5 or 6 live D registers, of which at least one is holding a constant, and 4 or 5 live A registers, of which perhaps two are holding pointers to data structures. That's 3 out of 10) * the store may be combined with another operation. For example, the last operation on the register may have been an add ADDL2 X,R1 This can be changed into ADDL3 X,R1,save_place A small saving, admittedly, but a saving. There is a third factor I have not assessed satisfactorily, which applies especially to RISC machines. Is the caller or the callee better able to distribute loads and stores through the code, so as to overlap any load or store delays? I suspect it is the callee, but that is a hunch. E. What about Hardware Help? On the question of a "high-level procedure call" implemented in hardware, such as the VAX CALLS, I think my opinion is known: such instructions are worthless. But what about a simpler instruction, such as a hardware register save, or save-under-mask? A good example is provided by the MOVEM instruction of the MC68020, or the LDM/STM of the PE3200. These are typical, and typically their break-even point is at about 4 registers. This suggests to me that they are or marginal value - after all, if you are already passing parameters in registers, how many are left to save around the typical call? The instructions are probably still worth having for things like a task context save (if only the PE3200 permitted them to be used that way!), but their contribution to a procedure protocol is slight. F. Possibly Offensive Remark I agree with Henry, that procedure calling protocols need to be thought afresh for each fresh machine. Unfortunately, very few compiler shops seem prepared to do this. I see over and again the model of a closed, downward-growing stack; caller pushes parameters; callee saves registers and allocates local space. One can do better than this by a factor of two or three, on almost any machine I know, with a different model more amenable to local optimisation. We are used to making up in software for badly designed hardware. Now we have the reverse: making up in hardware for badly designed software. At least, that is my explanation for register windows.