Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!think!mintaka!mit-eddie!uw-beaver!uw-june!pardo
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.sys.m88k
Subject: Re: Register Allocation (was Re: Info about 88open & standards)
Summary: long
Message-ID: <10007@june.cs.washington.edu>
Date: 30 Nov 89 18:25:18 GMT
References: <1989Nov16.212149.9770@paris.ics.uci.edu> <100050002@hpcuhc.HP.COM> <THOMAS.89Nov28100512@uplog.uplog.se> <9972@june.cs.washington.edu> <7263@sybase.sybase.com>
Reply-To: pardo@june.cs.washington.edu (David Keppel)
Distribution: na
Organization: University of Washington, Computer Science, Seattle
Lines: 113

> pardo@june.cs.washington.edu (David Keppel) writes:
>>On the RISC-family processors, you'll probably
>>lose a lot.  Is it worth adding save-by-mask hardware?  Probaly not.

In article <7263@sybase.sybase.com> tim@binky.UUCP (Tim Wood) writes:
>Please subtantiate these assertions.

A couple other people have already; allow me to summarize.

A RISC has [is defined as having] a load/store architecture, one
instruction issued per clock cycle.  Save-by-mask in hardware is
non-RISC.  Implementing save-by-mask in software requires several
instructions to be executed for each (potential) save.

Between the blowup in code size and the extra instructions, you're
probably better off blindly saving n registers.  Remember that
conditionally saved registers must also be conditionally restored at
function return.

A (long) example

Suppose callee saves 8, caller saves 8.  The compiler will leave
caller saves registers empty (don't need saving) aroudn function
calls.  The caller could thus needlessly save up to 8 registers.
If the callee needs registers it will try to use registers that were
saved by the callee.  If that fails it will save up to 8 registers on
its own, but those registers might be dead in the caller, so up to 8
registers could be saved needlessly.  The wasted caller-saves and
wasted callee-saves can never overlap.  At most 8 registers are saved
needlessly.

If load/store is pipelined, then the worst-case code looks like, say

    add sp 32 sp
    store r1 0[sp]
    store r2 4[sp]
    store r3 8[sp]
    store r4 12[sp]
    store r5 16[sp]
    store r6 20[sp]
    store r7 24[sp]
    store r8 28[sp]
    call
    nop
    load 28[sp] r8
    ...
    subl sp 32 sp

That's 19 clock ticks wasted.  The symmetric case for callee saves
similarly wastes 19 clock ticks.

Now consider save-by-mask
  
    [r1 to r16 can are saved by mask.]
    [mask passed in rN, shadow is rM.]
    [Destination is rightmost operand.]
    [No branch delay slots.]
  
  function:
    jmp !mask done
  r1:
    and mask 1 shadow
    jmp !shadow r2
    save r1
  r2:
    and mask 2 shadow
    jmp !shadow r2
    save r2
    jmp !mask done
  r3:
  ...
  
  r16:
    and mask 2*^15 shadow
    jmp !shadow done
    save r16
  done:

(Note: you may well be able to do better than this.  This is an
example of one sequence for one hypothetical machine.)

The worst case is 16*3 = 48 instructions, but if jumps introduce more
than one bubble, then it could blow up to about 16*4 = 64
instructions.  Let's go with 48.  On average you should have to save
about half the registers (a guess).  A clever optimzier will put
registers in low-number registers, so that you don't generally have to
go all the way to r16 in order to save half the registers.  So on
average you'll hit about 24 instructions in the prolog and about 24
instructions in the epilog, or about 48 instructions.

Of course the register saves are highly machine- and
application-dependent.  Suppose that generally you have to save only 2
registers and that they are r1 and r3.  Then you'll execute 9
instructions on entry and 9 instructions on exit and you're only doing
1 save better than the *worst* case of the naive caller-saves-some and
callee-saves-some!

Of course this all needs to be parameterized on a machine-by-machine
basis, and also on an application-by-application basis.

* Average number of registers to save.
* Average number of saves that are wasted.
* Cost to blindly do split caller/callee saves
* Cost to do save-by-mask.

If you really care, you should start by reading the stuff availabile
via anonymous ftp from june.cs.washington.edu  (128.95.1.4) in
tmp/pardo/proc.call.txt

	;-D on  ( Lies, damn lies, and USENET )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo