Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!mit-eddie!ll-xn!ames!ucbcad!ucbvax!hplabs!pyramid!prls!mips!hansen
From: hansen@mips.UUCP (Craig Hansen)
Newsgroups: comp.arch
Subject: Re: register windows
Message-ID: <821@mips.UUCP>
Date: Thu, 22-Oct-87 15:34:09 EST
Article-I.D.: mips.821
Posted: Thu Oct 22 15:34:09 1987
Date-Received: Sun, 25-Oct-87 10:09:39 EST
References: <201@PT.CS.CMU.EDU> <933@cpocd2.UUCP>
Lines: 77
Keywords: register windows, interrupt latency
Summary: register windows << fast loads & stores

In article <933@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes:
> In article <201@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
> >There are those who argue that interrupts and task switches are enormously
> >rare compared to subroutine calls and returns. They are right. They then
> >argue that the rare events can be penalized, if this makes the common events
> >run faster. (The RISC argument, if you will.)
> 
> The argument is simply that this approach leads to the best average
> performance.  Which it does.

Sigh. As always, RISC means many different things to different people.  "The
RISC argument" leads to varying conclusions, depending on what you're
focusing on, and what your design team can and can't do easily. I've worked
on three RISC architectures in different environments, and they each ended
up quite unique. The MIPS team has the strongest software and measurement
skills of the three designs, and it let us delegate a lot more to software,
and quickly see where design trade-offs were taking us.

Unfortunately, comparing interrupt and task switch rates to subroutine call
and return rates is only part of the story. The register window approach
trades hardware resources against optimizing software. The MIPS-R2000, which
doesn't have hardware register windows, relies on an intelligent use of a
flat, uniform, register set, implemented by optimizing compilers.  As has
been described in this forum previously, procedure arguments and return
values are passed in registers. In addition, the compiler system identifies
leaf-level routines, and allocates variables into registers, starting at
these leaf-level routines, working its way up to call tree. When registers
must be spilled to memory, the optimizer selects appropriate locations for
the register saving and restoring code (moving up call tree and outside
loops, etc.). The end result is a healthy reduction in the amount of
register saving and restoring at procedure boundaries. The register window
approach opts for simpler software (a good idea if you don't have software),
but spends more die area, and ultimately, cycle speed, on register file
accesses and register file spilling hardware.

Ultimately, the selection of an architecture is the result of several
inter-related design trade-offs. While it is not an absolute requirement of
register windows, let me observe that every one of these machines built to
date takes multiple cycles to perform canonical load and store operations.
There's no question that register windows can, in some cases, reduce load
and store frequencies, but to really pay their cost, they have to reduce
load and store frequencies sufficiently to offset the higher cost of load
and store operations on these machines. I've not been on the design team of
a register windowed machine, but it seems that the H/W designers might have
assumed that the register windows were going to eliminate so many memory
references that they didn't spend the time they should have on making loads
and stores go fast.

If you are concerned about UNIX kernel performance, it would be worthwhile
to note that the MIPS-R2000 instruction set and compiler system are designed
to permit efficient references to data contained in structures (a reference
to a structure-element through a pointer is a single-cycle operation, but is
two-to-three cycles at minimum for the SPARC and Am29k register window-based
machines. Now what's the relative frequency of structure references in a
UNIX kernel? (Hint: Dhrystone will give you the wrong answer. I'm not trying
to be glib here; dhrystone is fine benchmark for comparing some things
[uncached machines without optimizing compilers, x86 compilers for Personal
Computers], but the characteristics of data references, particularly the use
of "addressing modes" and data dependencies, aren't representative of
anything.)

If you've got some dusty FORTRAN card decks that you've been using on some
itty-bitty-mainframe computer, you should note that procedure calls aren't
all that frequent, compared to Dhrystone. Having an optimizer that can put a
couple of dozen variables (not necessarily local variables) into registers
efficiently, and make fast references to global variables (FORTRAN is
notorious for its use of variables stuffed into common blocks) is a big win
here. The MIPS compiler system puts aside one register to point to a region
of the global data space, where one single-cycle instruction can get to it's
value.  Again, here, the speed of basic load and store operations will
matter more than whether the registers are windowed.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen