Path: utzoo!attcan!uunet!littlei!omepd!mcg
From: mcg@omepd (Steven McGeady)
Newsgroups: comp.arch
Subject: Re: Procedure Call Protocol
Message-ID: <3926@omepd>
Date: 16 Nov 88 05:26:22 GMT
References: <3300037@m.cs.uiuc.edu> <5938@killer.DALLAS.TX.US> <7580@aw.sei.cmu.edu>
Organization: Intel Corp., Hillsboro
Lines: 169

In article <7583@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

>	An Example Procedure Protocol
>	-----------------------------
>
>Readers of this newsgroup know that I'm rarely reticent about
>criticising other people's hardware and software.  Here is your
>chance to do me the same service.

...
[example of proposed MIPS R2000 linkage code]
...

>The cost of a basic procedure call, entry and exit is therefore 12 cycles,
>ignoring any parameters or result.  Not bad, when you reflect that the
>Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison,
>of course, since that machine is handicapped by having special call and
>return instructions).

In general I ignore Mr. Firth's gratuitous slams on the 960 (why does
he have it out for just our processor?  Did Intel decline to offer him
a job at some point in the past?) ... but I feel a need to set the
record straight.  The 960 does a good deal *more* with its call and
return instructions that Mr. Firth has noted, not a good deal less.
Further, the 960 architecture (as opposed to a single implementation)
is benefited by a call and return instructions, as they (as I will show)
allow the processor to scale its performance to different implementation
techniques.

I'll start by saying that Mr. Firth is being generous.  The current
implementation of the 960 has a call instruction that takes 9 cycles,
and a return that takes 7, for a total of 16, rather than 15 cycles.
His generosity stops at this point.  During 16 cycles of a call/ret,
the following operations take place:

	1) the stack is adjusted for saving procedure-local registers,
	   the frame pointer is adjusted, and the previous value of
	   the frame-pointer is saved
	2) procedure-local registers are saved in the local register cache
	3) the return intruction pointer is loaded with the address of
	   the next instruction in the caller

"Well!" I hear you cry.  "What if I don't want to do all of that?"
There is an easy answer.  Mr. Firth fails to mention the 'bal' (branch
and link) instruction in the 960, the exact analog of the MIPS 'jal'
instruction, with the expception that the 960 does not have a delay slot.
The bal instruction execute in 2 cycles if the target is in the instruction
cache (which, incidentally, is the same restriction that MIPS has, though
Mr. Firth does not mention it).  Since you also have 32 registers on the
960, one can construct the same calling sequence on the 960 that Mr. Firth
constructs on the R2000.  This is wise for leaf procedures, and we do it.
It is generally unwise for non-leaf procedures:

Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving
procedure-local registers to the stack.  Typical C programs use between 4 and
12 procedure-local variables (variables local to the procedure which are not
referenced by enclosed scopes and thus can be allocated permanantly to
registers).  This adds between 4 and 12 additional memory references at
call and return in either the calling site or the called site (depending on
whether you are using caller or callee save or some combination thereof).
If you (charitably) consider the average to be 6, and (charitably) assume
that Mr. Firth is running from zero-wait-state cache and that he gets no cache
misses, and (charitably) assume that he can fill the delay slots on three
of these stores, you are left with an additional 18 clocks, making Mr. Firth's
"12 cycle" (unoptimized) call/return into a 30 cycle call/return.

Now, lest I be unfair, I will point out that the 80960 has a penalty when the
register cache spills to memory (24 clocks).  However, since procedures that
oscillate within a relatively small call depth *never* spill to memory, the
effect is taken in only a fraction of all routines.  Patterson has done some
research to this effect, and our research shows that 4 sets of 16 registers
(non-overlapped) as currently implemented in the 80960K*, handle about
80-85% of calls and returns (8 sets would handle about 98%).  So let's add
8 cycles to our average 80960 call/return time, making it 24 cycles - really
not all that much different from MIPS.

Now let us set about optimizing each of these.  Dave Wall (see SigPlan'88
Compiler Construction Conference Proceedings) would argue that link-time
allocation of a large (~ 58) global register set would minimize the register
spill overhead of a MIPS-style machine (though the differences between 32
and 58 registers are not clear).  However, MIPS (see also SigPlan'88) does not
use this technique, instead "shrink-wrapping" register spills and pushing them
up (and down) the call tree.  I don't have the papers handy so I can't
reproduce actual data.  Nevertheless, with these techniques and those
Mr. Firth outlines, you can further reduce this time, *if you have an
extremely sophisticated compiler*.

The 80960 gets comparable performance without this sophistication.
The 80960 linker employs a very simple technique of promoting calls to
leaf procedures from 'call' instructions, which the compiler outputs
to 'bal' instructions - the called site declares that it is a leaf
procedure (with the '.leafproc' pseudo-op) and provides both leaf and
non-leaf entry prologues.  The linker (at link time and across modules)
replaces call instructions with bal instructions to the same address.
If strange code somewhere has taken the address of the procedure and
calls it indirectly, it enters though the standard prologue.

I charge that, in general, and factoring out compiler and tool differences,
the difference between the MIPS calling sequence and the current implementation
of the 960 calling sequence is minimal.  This is supported by David Wall's
paper (though I should point out that Wall does not like register-window
schemes, though I don't think he has looked at the 960's non-overlapping
register cache scheme).  A globally-optimizing compiler for the 960 would
instead concentrate on procedure inlining and more advanced call-tree
depth-reduction techniques.

So, if MIPS did just as well with no call/return, why did we implement
it?  There are a number of reasons not touched above:

	1) code density - the 960 call/return instructions typically
	   occupy one word each (two for call if there is more than a
	   21-bit displacement involved);  as Mr. Firth admirably
	   showed, many architectures use 6-12 words for this.
	   In call-intensive programs this adds up quickly (960
	   code density is about 15-25% worse than a VAX, MIPS code
	   density is reported to be 40-50% worse for comparable
	   compiler technology).

	2) increased parallelism -  stack-cache spills in the 960
	   can be overlapped with subsequent (target) instruction
	   execution.  Furthermore, the 960 call/return instructions
	   in the current implementation are implemented with a
	   minimum of hardware to keep chip cost low.  These
	   instructions (like all 960 "microcode" flows) are really
	   on-chip macros of (mostly) normal instructions.  Future
	   implementations, however, will be able to exploit additional
	   hardware to dramatically increase the speed of these operations.
	   Ultimately, we expect call/ret to be a 2 clock operation
	   (rendering the bal optimization obsolete).  The MIPS
	   linkage sequence will always be frozen as N separate
	   instructions, hence N clocks (pick your favourite N in
	   the range 6-30).

	3) provision for a range of implementations - it may seem
	   silly to some people to build less than the fastest possible
	   chip, but to those people I suggest that they attempt to
	   build a MIPS system on a 64 sq.in. board with 100ns DRAMS
	   and see how fast it goes.  The 960 is architected so that
	   we can provide processors with simple, low-cost busses
	   that work with slow or moderate-speed memory systems.
	   The speed of the register-cache spill is directly a function
	   of the bus architecture.

I have the greatest respect for the MIPS architects and designers.
They built an architecture using the technology and under the limitations
they had at hand, which included compiler technology (very good), 
silicon technology (foundry-based and uncertain), and silicon budget
(limited).  They produced a chip which is admirably well-honed for
the workstation market to which it is targeted (except, of course,
it doesn't run MS-DOS :-) ).  The MIPS implementations will tend to
scale directly with increases in process technology and clock speed
(I believe that this is MIPS' contention), until they have to
put microwave waveguides on the package (apologies to Nick Treddenick).

In general, the 960 K-series implementation is a practical and inexpensive
one for small systems without large caches and wide busses, and not the peak
implementation of architecture.  (Read Glen Myers' book
*The 80960 Microprocessor Architecture*, Wiley, 1988, ISBN 0 471-61857-8,
for more details about some of the trade-offs).  The 960 architecture will
scale better than linearly, because the architecture allows more
opportunities for fine-grained parallelism.

I hope that the next time Mr. Firth sallys forth on such an exposition,
he either presents a more balanced view, or confines himself to commentary
on a processor with which he is familiar.  If he has additional questions
concerning the 960, I would be glad to answer them.

S. McGeady
Intel Corp.