Path: utzoo!attcan!uunet!littlei!omepd!mcg From: mcg@omepd (Steven McGeady) Newsgroups: comp.arch Subject: Re: Procedure Call Protocol Message-ID: <3926@omepd> Date: 16 Nov 88 05:26:22 GMT References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> <7580@aw.sei.cmu.edu> Organization: Intel Corp., Hillsboro Lines: 169 In article <7583@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: > An Example Procedure Protocol > ----------------------------- > >Readers of this newsgroup know that I'm rarely reticent about >criticising other people's hardware and software. Here is your >chance to do me the same service. ... [example of proposed MIPS R2000 linkage code] ... >The cost of a basic procedure call, entry and exit is therefore 12 cycles, >ignoring any parameters or result. Not bad, when you reflect that the >Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison, >of course, since that machine is handicapped by having special call and >return instructions). In general I ignore Mr. Firth's gratuitous slams on the 960 (why does he have it out for just our processor? Did Intel decline to offer him a job at some point in the past?) ... but I feel a need to set the record straight. The 960 does a good deal *more* with its call and return instructions that Mr. Firth has noted, not a good deal less. Further, the 960 architecture (as opposed to a single implementation) is benefited by a call and return instructions, as they (as I will show) allow the processor to scale its performance to different implementation techniques. I'll start by saying that Mr. Firth is being generous. The current implementation of the 960 has a call instruction that takes 9 cycles, and a return that takes 7, for a total of 16, rather than 15 cycles. His generosity stops at this point. During 16 cycles of a call/ret, the following operations take place: 1) the stack is adjusted for saving procedure-local registers, the frame pointer is adjusted, and the previous value of the frame-pointer is saved 2) procedure-local registers are saved in the local register cache 3) the return intruction pointer is loaded with the address of the next instruction in the caller "Well!" I hear you cry. "What if I don't want to do all of that?" There is an easy answer. Mr. Firth fails to mention the 'bal' (branch and link) instruction in the 960, the exact analog of the MIPS 'jal' instruction, with the expception that the 960 does not have a delay slot. The bal instruction execute in 2 cycles if the target is in the instruction cache (which, incidentally, is the same restriction that MIPS has, though Mr. Firth does not mention it). Since you also have 32 registers on the 960, one can construct the same calling sequence on the 960 that Mr. Firth constructs on the R2000. This is wise for leaf procedures, and we do it. It is generally unwise for non-leaf procedures: Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving procedure-local registers to the stack. Typical C programs use between 4 and 12 procedure-local variables (variables local to the procedure which are not referenced by enclosed scopes and thus can be allocated permanantly to registers). This adds between 4 and 12 additional memory references at call and return in either the calling site or the called site (depending on whether you are using caller or callee save or some combination thereof). If you (charitably) consider the average to be 6, and (charitably) assume that Mr. Firth is running from zero-wait-state cache and that he gets no cache misses, and (charitably) assume that he can fill the delay slots on three of these stores, you are left with an additional 18 clocks, making Mr. Firth's "12 cycle" (unoptimized) call/return into a 30 cycle call/return. Now, lest I be unfair, I will point out that the 80960 has a penalty when the register cache spills to memory (24 clocks). However, since procedures that oscillate within a relatively small call depth *never* spill to memory, the effect is taken in only a fraction of all routines. Patterson has done some research to this effect, and our research shows that 4 sets of 16 registers (non-overlapped) as currently implemented in the 80960K*, handle about 80-85% of calls and returns (8 sets would handle about 98%). So let's add 8 cycles to our average 80960 call/return time, making it 24 cycles - really not all that much different from MIPS. Now let us set about optimizing each of these. Dave Wall (see SigPlan'88 Compiler Construction Conference Proceedings) would argue that link-time allocation of a large (~ 58) global register set would minimize the register spill overhead of a MIPS-style machine (though the differences between 32 and 58 registers are not clear). However, MIPS (see also SigPlan'88) does not use this technique, instead "shrink-wrapping" register spills and pushing them up (and down) the call tree. I don't have the papers handy so I can't reproduce actual data. Nevertheless, with these techniques and those Mr. Firth outlines, you can further reduce this time, *if you have an extremely sophisticated compiler*. The 80960 gets comparable performance without this sophistication. The 80960 linker employs a very simple technique of promoting calls to leaf procedures from 'call' instructions, which the compiler outputs to 'bal' instructions - the called site declares that it is a leaf procedure (with the '.leafproc' pseudo-op) and provides both leaf and non-leaf entry prologues. The linker (at link time and across modules) replaces call instructions with bal instructions to the same address. If strange code somewhere has taken the address of the procedure and calls it indirectly, it enters though the standard prologue. I charge that, in general, and factoring out compiler and tool differences, the difference between the MIPS calling sequence and the current implementation of the 960 calling sequence is minimal. This is supported by David Wall's paper (though I should point out that Wall does not like register-window schemes, though I don't think he has looked at the 960's non-overlapping register cache scheme). A globally-optimizing compiler for the 960 would instead concentrate on procedure inlining and more advanced call-tree depth-reduction techniques. So, if MIPS did just as well with no call/return, why did we implement it? There are a number of reasons not touched above: 1) code density - the 960 call/return instructions typically occupy one word each (two for call if there is more than a 21-bit displacement involved); as Mr. Firth admirably showed, many architectures use 6-12 words for this. In call-intensive programs this adds up quickly (960 code density is about 15-25% worse than a VAX, MIPS code density is reported to be 40-50% worse for comparable compiler technology). 2) increased parallelism - stack-cache spills in the 960 can be overlapped with subsequent (target) instruction execution. Furthermore, the 960 call/return instructions in the current implementation are implemented with a minimum of hardware to keep chip cost low. These instructions (like all 960 "microcode" flows) are really on-chip macros of (mostly) normal instructions. Future implementations, however, will be able to exploit additional hardware to dramatically increase the speed of these operations. Ultimately, we expect call/ret to be a 2 clock operation (rendering the bal optimization obsolete). The MIPS linkage sequence will always be frozen as N separate instructions, hence N clocks (pick your favourite N in the range 6-30). 3) provision for a range of implementations - it may seem silly to some people to build less than the fastest possible chip, but to those people I suggest that they attempt to build a MIPS system on a 64 sq.in. board with 100ns DRAMS and see how fast it goes. The 960 is architected so that we can provide processors with simple, low-cost busses that work with slow or moderate-speed memory systems. The speed of the register-cache spill is directly a function of the bus architecture. I have the greatest respect for the MIPS architects and designers. They built an architecture using the technology and under the limitations they had at hand, which included compiler technology (very good), silicon technology (foundry-based and uncertain), and silicon budget (limited). They produced a chip which is admirably well-honed for the workstation market to which it is targeted (except, of course, it doesn't run MS-DOS :-) ). The MIPS implementations will tend to scale directly with increases in process technology and clock speed (I believe that this is MIPS' contention), until they have to put microwave waveguides on the package (apologies to Nick Treddenick). In general, the 960 K-series implementation is a practical and inexpensive one for small systems without large caches and wide busses, and not the peak implementation of architecture. (Read Glen Myers' book *The 80960 Microprocessor Architecture*, Wiley, 1988, ISBN 0 471-61857-8, for more details about some of the trade-offs). The 960 architecture will scale better than linearly, because the architecture allows more opportunities for fine-grained parallelism. I hope that the next time Mr. Firth sallys forth on such an exposition, he either presents a more balanced view, or confines himself to commentary on a processor with which he is familiar. If he has additional questions concerning the 960, I would be glad to answer them. S. McGeady Intel Corp.