Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!mit-eddie!genrad!decvax!ucbvax!amdcad!bcase
From: bcase@amdcad.UUCP
Newsgroups: comp.arch
Subject: Re: subroutine frequency
Message-ID: <14963@amdcad.UUCP>
Date: Wed, 25-Feb-87 12:47:40 EST
Article-I.D.: amdcad.14963
Posted: Wed Feb 25 12:47:40 1987
Date-Received: Fri, 27-Feb-87 21:12:04 EST
References: <1881@homxc.UUCP> <898@moscom.UUCP> <476@mntgfx.MENTOR.COM>
Reply-To: bcase@amdcad.UUCP (Brian Case)
Organization: Advanced Micro Devices, Sunnyvale, California
Lines: 58

In article <5746@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>>Some RISC machines have multiple sets of registers, organized as overlapping
>>windows.  This approach to speeding up calls/returns was suggested to the
>>RISC I team at Berkeley by Forest Baskett; simulations convinced us it would
>>work for typical UNIX code, so we implemented it.  It worked.  
>>	Howard A. Landman
>
>As I understand it, each register window is a fixed size, and each window
>is divided into three fixed size subwindows.  (I don't remember what
>these three subwindows are meant to be used for.)  One particular aspect
>of the implementation has been bothering me; namely, the amount of
>overlap between two windows is a constant size.  This has two consequences.
> ...
>So I'm wondering why the implementation of the RISC I and II don't
>allow the programmer to specify the number of registers that should
>be overlapped on the procedure call.  Obviously, this would make
>subroutine calls more expensive and difficult to implement.  The
>overlap size would have to be fetched from memory and added to the
>window pointer.  Checking to see if the register stack had overflowed
>might be more expensive...  However, these seem like trivial problems
>when compared to the benefit of effectively having a larger
>register stack.

BTW, the Pyramid machine also have fixed-sized windows (16 in, 16 local,
and 16 out, plus 16 globals).

Well, the main reason that arbitrary window size was not implemented in the
RISC I & II is that it requires an adder in the register address decode
path.  Using the technology available to them, this would have slowed the
machine and used more area.  Using the Berkeley scheme, the "add" function
is integrated right into the register file address decoders.

Arbitrary window size is not that difficult to implement, assuming that
you can pay the cost of the adder.  The window size does not need to be
separately fetched from memory:  it can be an immediate field in an
instruction.  In the implementation that I am thinking of, based on my
MS thesis work, the window is allocated at procedure entry time and 
deallocated at procedure exit time, just as in the RISC I & II.  Thus,
the window which is allocated must be made large enough to contain the
outgoing parameters for any subsequent call.  Also, the bounds checking
operation to detect overflow and underflow need only be done once at
call time and once at exit time.  See Ditzel's paper on the C Machine
Stack Cache for a more general implementation (but expensive!!).

Yes, arbitrary window size does make calls and returns more expensive, but
not by much (under the assumptions I have made).  The benefits of better
register utilization and accomodation of any argument list outweigh the
small cost.  The call and return are still an order of magnatude faster
(in terms of cycles) than the "calls"-style exemplified by the vax.

Again, a very interesting alternative to a stack cache is register allocation
at link time.  Wall from Dec Western Research has done/is doing the first
work I have seen in this area.

    bcase
----------
Keep watching this space just a little longer.