Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!mit-eddie!genrad!decvax!ucbvax!amdcad!bcase From: bcase@amdcad.UUCP Newsgroups: comp.arch Subject: Re: subroutine frequency Message-ID: <14963@amdcad.UUCP> Date: Wed, 25-Feb-87 12:47:40 EST Article-I.D.: amdcad.14963 Posted: Wed Feb 25 12:47:40 1987 Date-Received: Fri, 27-Feb-87 21:12:04 EST References: <1881@homxc.UUCP> <898@moscom.UUCP> <476@mntgfx.MENTOR.COM> Reply-To: bcase@amdcad.UUCP (Brian Case) Organization: Advanced Micro Devices, Sunnyvale, California Lines: 58 In article <5746@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>Some RISC machines have multiple sets of registers, organized as overlapping >>windows. This approach to speeding up calls/returns was suggested to the >>RISC I team at Berkeley by Forest Baskett; simulations convinced us it would >>work for typical UNIX code, so we implemented it. It worked. >> Howard A. Landman > >As I understand it, each register window is a fixed size, and each window >is divided into three fixed size subwindows. (I don't remember what >these three subwindows are meant to be used for.) One particular aspect >of the implementation has been bothering me; namely, the amount of >overlap between two windows is a constant size. This has two consequences. > ... >So I'm wondering why the implementation of the RISC I and II don't >allow the programmer to specify the number of registers that should >be overlapped on the procedure call. Obviously, this would make >subroutine calls more expensive and difficult to implement. The >overlap size would have to be fetched from memory and added to the >window pointer. Checking to see if the register stack had overflowed >might be more expensive... However, these seem like trivial problems >when compared to the benefit of effectively having a larger >register stack. BTW, the Pyramid machine also have fixed-sized windows (16 in, 16 local, and 16 out, plus 16 globals). Well, the main reason that arbitrary window size was not implemented in the RISC I & II is that it requires an adder in the register address decode path. Using the technology available to them, this would have slowed the machine and used more area. Using the Berkeley scheme, the "add" function is integrated right into the register file address decoders. Arbitrary window size is not that difficult to implement, assuming that you can pay the cost of the adder. The window size does not need to be separately fetched from memory: it can be an immediate field in an instruction. In the implementation that I am thinking of, based on my MS thesis work, the window is allocated at procedure entry time and deallocated at procedure exit time, just as in the RISC I & II. Thus, the window which is allocated must be made large enough to contain the outgoing parameters for any subsequent call. Also, the bounds checking operation to detect overflow and underflow need only be done once at call time and once at exit time. See Ditzel's paper on the C Machine Stack Cache for a more general implementation (but expensive!!). Yes, arbitrary window size does make calls and returns more expensive, but not by much (under the assumptions I have made). The benefits of better register utilization and accomodation of any argument list outweigh the small cost. The call and return are still an order of magnatude faster (in terms of cycles) than the "calls"-style exemplified by the vax. Again, a very interesting alternative to a stack cache is register allocation at link time. Wall from Dec Western Research has done/is doing the first work I have seen in this area. bcase ---------- Keep watching this space just a little longer.