Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!think.com!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!dkuug!freja.diku.dk!gere.diku.dk!torbenm
From: torbenm@gere.diku.dk (Torben [gidius Mogensen)
Newsgroups: comp.arch
Subject: Re: registerless architecture
Message-ID: <1990Nov14.113748.3677@diku.dk>
Date: 14 Nov 90 11:37:48 GMT
References: <1990Nov12.145410.29035@cs.cmu.edu>
Sender: news@diku.dk (The Netnews System)
Organization: Department Of Computer Science, University Of Copenhagen
Lines: 83

All this discussion about registers versus no registers but cache has
centered about the costs of implementing local variables: either by
keeping them in registers and then loading/saving them across calls,
or in memory and make the cache take care of loading/saving to main
memory.

There are several issues involved in this:

1) What are the relative costs of accessing registers and cache
memory.

2) What are the relative cost of loading/saving a (large) set of
registers in a burst, versus letting the cache do so at its own pace.

3) How do you effectively map a fixed number of registers to a
variable number of local variables. (Register allocation).


As for 1), there is a general acceptance that registers are faster
than cache, but only slightly so. The main problem is that, with
present cache architectures, you can access only one cache element per
cycle, whereas it is possible to access several registers. This can be
solved by implementing a multiported cache. Also, it might be possible
to access on-chip cache as fast as registers (to within a few percent)
if the architecture is designed for this.


As for 2), there has been arguments pointing both ways. Sequential
mode access to memory is faster than random access, so this speaks in
favor of loading/saving in bursts. It must be noted that many modern
cache architectures do the same: If there is a cache miss, a block of
cells is saved/loaded in one go.

The main difference is that with registers, loading/saving is done in
a consistent (compile-time) fashion, whereas with cache it is done by
need (run-time). The compile-time register saving will often lead to
unnecessary memory traffic, as registers might be saved, only to be
loaded again immediately afterwards, because the procedure returns
immediately (typically the leaf calls in a recursive algorithm). The
same may happen, to a lesser degree, if the cache use burst mode
access to main memory.

On the other hand, local variables need not be saved when leaving a
procedure, and the register saving strategy will know that.  The local
variables will still be kept in the cache, so they will be saved
unnecessarily when the cache locations are re-used.  Note that this
does not happen all the time: if a return is followed by another call,
the same memory locations are re-used, so no cache miss occur.

It is possible to make a stack-cache that will know that values above
the stack top are garbage, and set the do-not-save bit. This begins to
look very much like variable sized register windows, the difference
being that you can address arbitrarily far down the stack in a
transparent fashion.


As for 3), register based architectures require register allocation to
map local variables to a possibly smaller number of registers. While
good algorithms exist, they are all compile-time based and must thus
take worst-case behaviour into account. This will invariably lead to
saving registers when this is (in a particular run-time situation) not
necessary. While the problem in most cases is small, it can in some
cases have a noticeable effect. This is especially true in languages
like LISP or Prolog, where basic blocks and procedures are small.


The above discussion has centered around loading and saving of local
variables, but there are other points to consider. In languages that
use a lot of heap access (LISP, Prolog,...), a multiported cache is a
huge benefit, whereas a large set of registers almost no help at all.
Even if the cache uses burst mode access, it can benefit a heap. In
fact the article "Cache Performance of Combinator Graph Reduction" by
Koopman, Lee and Siewiorek in this years IEEE International Conference
on Computer Languages argues that a large cache block size is
beneficial for such languages.


All in all, IMHO a well-designed registerless architecture with a
multiported cache can perform just as well as a register architecture
on C-like languages and far better on languages like LISP and Prolog.


	Torben Mogensen (torbenm@diku.dk)