Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Register Count Message-ID: Date: 13 Jan 91 17:40:42 GMT References: <11566@pt.cs.cmu.edu> Sender: cho@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 181 Nntp-Posting-Host: odin In-reply-to: lindsay@gandalf.cs.cmu.edu's message of 12 Jan 91 20:59:26 GMT On 12 Jan 91 20:59:26 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said: lindsay> In article lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: pcg> Loop unrolling only wins if you have some degree of multiple pcg> functional units, otherwise you only buy reduced overhead, which is pcg> not much, especially if the implementation has some degree of pcg> pipelining. lindsay> Unrolling (to reduce overhead) is the basic trick of hand-coded lindsay> memcpy routines, on such "multiple functional unit" classics as lindsay> the 8080 and 68010. Pipelining makes unrolling more useful, not lindsay> less. My fault -- I was not discussing 8080s and 68010s, but contemporary machines. If there is *no* suitable pipelining, like for those, unrolling buys you reduced overhead, as I wrote. Not a lot, and you don't need a lot of unrolling to reduce overhead suitably, and since you have no pipelining you can reuse the registers in each stage of the unrolling. I remember having done some experiments with copying core on a PDP-11/70; unrolling beyond three or four was mostly pointless; with loops with a large body the advantage was even smaller. If you have *some limited* degree of pipelining, as in contemporary implementations, such as the classic three-four stage pipeline that overlaps some computation with some control, and especially if this pipeline is exposed with things like delayed branches, then unrolling buys you nothing at all in time, and loses code space. Maybe you could get up to date on the "benefits" of loop unrolling for contemporary, shallow pipelined, implementations if you got my CoreCopy() function from a comp.arch archive, and tried it on several contemporary machines with different values for its several options. Just to inspire your curiosity, I have found that for example on an implementation with a shallow pipeline like the 68020 unrolling buys you nothing at all. This is not untypical. I have been sent an example (thanks!) that shows that on a MIPS R30x0, which has a shallow implementation, one can keep the floating point pipeline busy with the use of 6-7 registers and *no* unrolling. This is not untypical either. pcg> It is an old controversy, but let me repeat here: 90% of what pcg> passes for "optimizing" compilers are compilers that optimize for pcg> *space*, not time, ... lindsay> The optimizing compilers that I worked on, were trying for lindsay> speed. We tuned the things by endlessly recompiling and lindsay> running our (ever-expanding) code-quality suite. We didn't lindsay> bother to measure how small the generated code was: we just lindsay> timed it. This seems (to me at least) a funny way of proceeding, and deomstrates a trial-and-error mentality that may be there because of lack of insight. Well, maybe that's the case. If you were aiming at one goal with technologies suited to achieving another goal, too bad for you. It can happen, even to the best; a lot of serious and important people that publish papers on "optimizing" compilers expound the funny view that things like graph coloring optimize speed and not space. I don't know who's the joke on. The joke is that graph coloring minimizes the *number* of spills, not their *cost* in time. Most spills have negligible cost in time, because they are not in critical paths, and are not executed frequently. If an "optimizer" does not attempt to minimize the *dynamic* cost of spills, by estimating the frequency of reuse of values either by feedback profiling or by heuristics, it will only achieve optimal timing by pure chance, or by having so many registers to play with that the cost of spills is zero because there aren't any (virtually all the so called "optimizers" for RISC machines). I had hoped that more people understood what David Wall and Hank Dietz can say on this subject (not to mention more esoteric technologies). pcg> Based on my armchair evidence I believe that most general purpose pcg> applications data flows can be modelled best with four multiple pcg> stacks (quite shallow, by the way, say four deep), because such pcg> dataflows have the shape of a tree with a relatively low branching pcg> factor, and relatively small number of levels in each independent pcg> branch. lindsay> Further, you are referring to floating point calculation. No, to nonfloat ones, explicitly. Yet I think that for float ones things are similar, for many, but maybe not most, applications. Several important numerical applications that (unfortunately) use floating point have FIFO dataflow reference patterns, and of course a SIMD/MIMD machine is far better suited to them than one based on multiple stacks, which is designed for (mythical :->) general purpose applications. I think that it is in order to point to the classic "The MU5 computer system", by Ibbett & Morris, MacMillan, for a clear description of how different must architecture be for different types of expected dataflow reference patterns. The MU5 was designed for FIFO, numerical codes, while it ended up running general purpose applications. Baaad vibes, man. lindsay> It is just as easy to believe that the modal stack depth would lindsay> be 1 (one), hence, the multiple stacks would be that many lindsay> (expensive) registers. Hey, you are stealing one of my lines here! That you need very few registers because most dynamically important expressions are terribly simple is *my* mantra. I introduced the idea of multiple stacks as a concession to the important, even if not the modal, case where the dataflow patterns do have some (limited) complexity, like multiple non nested nests of computation. Multiple stacks is like flat register file, except that it takes advantage of the nested nature of each of the strands to allows for separate stacks. Maybe I should repeat the argument more clearly (hopefully). Suppose you have determined that you need *up to* 16 spare registers to cache optimally your dataflow reference patterns, that exhibit typically *up to* 4 independent strands of computation, each of them involving maybe *up to* 4 levels of spill. How should you organize the 16 registers? If you choose a flat register file you need 4+4(+4) bits to encode the operand numbers and you need, for an N<=4-way superscalar implementation, an N+k multiported file. If you use four register stacks with four levels each, you need 2 bits per operand number and each separate stack can be single ported. Naturally the flat register file can cope much better with data reference patterns that are not *up to* four way independent and *up to* four way deep each, but I surmise that these are relatively rare, for general purpose applications. Naturally if you think that you can cache the locality of a code with less than 16 registers, please use less than those, or if you think that the data reference patterns exhibit no exploitable nesting, then use a flat register file with its generality and associated costs. More research needs to be done on this subject; I do have anectodal evidence (and have read some papers that seem to imply) that 4x4 is a good trade off between the number of registers and the profile of most codes and their exploitable parallelism. I could be persuaded easily that *most* codes do not need those many, because that's my own mentra, but then the same argument would apply to the size of a flat register file. It is sad that people like you spend lots of time in trial-and-error attempts to get an "optimizer" going using possibly irrelevant technology, instead of investigating important issues like the shape of dataflow reference patterns for various classes of applications. Or maybe too many products/careers have been built on large register banks and "optimizers" for this. My argument, more than being centered on the number four (which I still believe to be a reasonable estimate), was centered on the idea that a flat register file does not reflect the expected reference patterns of general purpose codes as the same number of registers arranged as multiple stacks. For certain types of codes, neither arrangements may be best -- vector and vector oriented superscalars (e.g. the ZS-1) have register *queues* instead of register stacks or flat files, as the expected dataflow patterns are FIFO, with no value reuse, but with deep pipelining. lindsay> But, on the whole, stack architectures are dead, Just as quantitative computer architecture is, if serious people can delude themselves that classic graph coloring optimizes run time and not code space, and there is a scant research on studying the actual reference patterns of classes of code, instead of building products/careers on questionable _assumptions_, that seem to work only because they are in fact never tested. lindsay> and dealing with a proposed one doesn't seem like a good use of lindsay> my time. Yes, probably reading the relevant literature is a better use of your time. Reread David Wall and company a million times... Or maybe the David Wall in my world is a pop singer in yours :-). -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk