Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!think.com!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!dkuug!freja.diku.dk!gere.diku.dk!torbenm From: torbenm@gere.diku.dk (Torben [gidius Mogensen) Newsgroups: comp.arch Subject: Re: registerless architecture Message-ID: <1990Nov14.113748.3677@diku.dk> Date: 14 Nov 90 11:37:48 GMT References: <1990Nov12.145410.29035@cs.cmu.edu> Sender: news@diku.dk (The Netnews System) Organization: Department Of Computer Science, University Of Copenhagen Lines: 83 All this discussion about registers versus no registers but cache has centered about the costs of implementing local variables: either by keeping them in registers and then loading/saving them across calls, or in memory and make the cache take care of loading/saving to main memory. There are several issues involved in this: 1) What are the relative costs of accessing registers and cache memory. 2) What are the relative cost of loading/saving a (large) set of registers in a burst, versus letting the cache do so at its own pace. 3) How do you effectively map a fixed number of registers to a variable number of local variables. (Register allocation). As for 1), there is a general acceptance that registers are faster than cache, but only slightly so. The main problem is that, with present cache architectures, you can access only one cache element per cycle, whereas it is possible to access several registers. This can be solved by implementing a multiported cache. Also, it might be possible to access on-chip cache as fast as registers (to within a few percent) if the architecture is designed for this. As for 2), there has been arguments pointing both ways. Sequential mode access to memory is faster than random access, so this speaks in favor of loading/saving in bursts. It must be noted that many modern cache architectures do the same: If there is a cache miss, a block of cells is saved/loaded in one go. The main difference is that with registers, loading/saving is done in a consistent (compile-time) fashion, whereas with cache it is done by need (run-time). The compile-time register saving will often lead to unnecessary memory traffic, as registers might be saved, only to be loaded again immediately afterwards, because the procedure returns immediately (typically the leaf calls in a recursive algorithm). The same may happen, to a lesser degree, if the cache use burst mode access to main memory. On the other hand, local variables need not be saved when leaving a procedure, and the register saving strategy will know that. The local variables will still be kept in the cache, so they will be saved unnecessarily when the cache locations are re-used. Note that this does not happen all the time: if a return is followed by another call, the same memory locations are re-used, so no cache miss occur. It is possible to make a stack-cache that will know that values above the stack top are garbage, and set the do-not-save bit. This begins to look very much like variable sized register windows, the difference being that you can address arbitrarily far down the stack in a transparent fashion. As for 3), register based architectures require register allocation to map local variables to a possibly smaller number of registers. While good algorithms exist, they are all compile-time based and must thus take worst-case behaviour into account. This will invariably lead to saving registers when this is (in a particular run-time situation) not necessary. While the problem in most cases is small, it can in some cases have a noticeable effect. This is especially true in languages like LISP or Prolog, where basic blocks and procedures are small. The above discussion has centered around loading and saving of local variables, but there are other points to consider. In languages that use a lot of heap access (LISP, Prolog,...), a multiported cache is a huge benefit, whereas a large set of registers almost no help at all. Even if the cache uses burst mode access, it can benefit a heap. In fact the article "Cache Performance of Combinator Graph Reduction" by Koopman, Lee and Siewiorek in this years IEEE International Conference on Computer Languages argues that a large cache block size is beneficial for such languages. All in all, IMHO a well-designed registerless architecture with a multiported cache can perform just as well as a register architecture on C-like languages and far better on languages like LISP and Prolog. Torben Mogensen (torbenm@diku.dk)