Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!usc!samsung!spool2.mu.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Register Count Summary: The Truth about "optimizers", for the millionth time. More interestingly, a treatise on value dataflow patterns, and how they map on common architecture classes. Message-ID: Date: 10 Jan 91 16:23:01 GMT References: <11538@pt.cs.cmu.edu> Sender: cho@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 296 Nntp-Posting-Host: odin In-reply-to: lindsay@gandalf.cs.cmu.edu's message of 9 Jan 91 05:45:14 GMT On 9 Jan 91 05:45:14 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said: lindsay> In article lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: pcg> Statistics exist that show that almost any expression and a pcg> majority of code sequences can, on a single threaded CPU (one with pcg> one ALU), be compiled without spills with FOUR spare registers[1]. pcg> To add abundant inter-expression caching we need another four. lindsay> This directly contradicts recent experience with optimizing lindsay> compilers. This is directly supported by all experience with optimizing compilers. It is an old controversy, but let me repeat here: 90% of what passes for "optimizing" compilers are compilers that optimize for *space*, not time, by caching into registers things that are frequently referenced in the program text, by minimizing spills, not those used in the program execution. Since on most machines, especially RISC machines, referring to an operand in main memory takes much more space than if it were in a register, these compilers stash into registers eveything in sight, and thus they do achieve code space compaction. Since most of these compilers make no attempt to estimate the dynamic frequency of use of values, they achieve good un time performance only if *all* values are put in registers; as far as time is concerned, whether most values are in a register or a spilled to memory is irrelevant. Fortunately since it is extremely rare[1] that there are more than a couple dozen values in any one section of code, that all values are cached can usually be ensured by having a 32 strong register bank. When you run a real _time_ optimizing compiler[2] the knee of the curve, for single threaded non float computations, lies at four spare registers and four global[3] registers. lindsay> You appear to have counted floating point values (and assumed lindsay> no unrolling or software pipelining) and taken that as the lindsay> contents of the register bank. I have said *nonfloat* and *single threaded*. In my view of the world, each functional units needs its own set of registers. Most of the currently available implementations are nearly single threaded, or in any case need only few registers[3]. Loop unrolling only wins if you have some degree of multiple functional units, otherwise you only buy reduced overhead, which is not much, especially if the implementation has some degree of pipelining. lindsay> Aside from the generated-in address computations, there are lindsay> uses such as scope uplinks, subroutine parameter lists, loop lindsay> control, and the like. And those are just the conventional lindsay> uses, not the agressive ones. I am getting tired repeating that that quantitative analysis of computer architecture, which existed well before Hennessy & Patterson's book, is about knees in the two curves, resource/benef it and resource/cost[4]. My current understanding, based on vast amounts of reading[5] and analysis, is that, for general purpose applications, the number of registers is an increasing function of the number of functional units, that the knee of the curves is at four spare registers (plus possibly four global ones) for each one, and that the knee of the curve in functional units is at four as well, as repeated here: pcg> My favourite solution would be to have multiple stacks. How many? FOUR pcg> is the answer, because it is exceedingly rare that an expression pcg> contains as many independent threads of computation. A superscalar pcg> machine may attach independent ALUs to each stack, or even specialize pcg> them[4]. pcg> [4] Indeed in practice superscalar implementations find it exceedingly pcg> hard to find degrees of microparallelism greater than two or three, I have had comments here and in the past that demonstrate a fundamental difference in point of view between me and some other people. I base my numbers on the computation patterns of general purpose applications -- let's say on the data flows of values in them. Characterizing general purpose applications is far from a static field, but by now we know quite a lot. How to map these onto an application is the architect's problem. Some people hold that the machine should be organized in the most dynamic fashion to mimic the dataflow of values in any application, and come up with superscalar dataflow machines. Based on my armchair evidence I believe that most general purpose applications data flows can be modelled best with four multiple stacks (quite shallow, by the way, say four deep), because such dataflows have the shape of a tree with a relatively low branching factor, and relatively small number of levels in each independent branch. If this is true, adding more functional units is not going to be of benefit, because interference between the dataflows will be high enouggh to cause too much trouble, and having deeper stack windows is not going to be terribly useful either, because the stack depth will only rarely cause its overflow, and that will happen infrequently. lindsay> That's two or three things _per-clock_, on a _pipelined_ lindsay> machine. You are planning to operate more than one stack per lindsay> clock? In a pipelined fashion, so that each stack can be lindsay> re-referenced while an outstanding computation hasn't yet lindsay> written back its result? Why not? Once you have multiple stacks, if the compiler is not demented, and the application does have a sufficient intrinsic degree of microparallelism, dataflows on the different stacks will be typically independent. For example "(a+b) * (c+d)" can easily be mapped to two stacks to be run in parallel. Independent branches of the dataflow tree/graph are prime candidates. Just look at programs and data as though they were written in Lisp :->. lindsay> How are the multiple consequences resolved? These are problems that all superscalar architectures must solve. Architectures with a flat register file model, I believe, end up simulating multiple stacks in the flat register file, whether they are superscalar or not, because that is how application data flows are. Now, _if_ we accept my belief (but I am ready to be converted!) that general purpose applications dataflows can be characterized as above (a few, short, independent "nests" of activity), we have the following alternatives: Scalar, register file, load/store Multiple stacklets are simulated in the register file. Switching the focus among the stacklets is instantaneous, because action can happen between any two registers, and any register can be "stack top". Processing of each thread, that is of each stacklet, is serial. It is important that the register file be able to cache entire stacklets, because referring to operands in memory is so awkward. Poor instruction density, because it does not exploit exploit expected dataflow patterns, and is designed for the worst case[7]. Scalar, register file, traditional Multiple stacklets are simulated on a memory stack, the stacletk tops (and as many further down as will fit) are kept in CPU registers, the rest (the "spills") are multiplexed on the memory stack. Switching the focus among the stacklets is nearly instantaneous, because most of the action happens anyhow near the stack tops, which are in registers. Accessing the bottom elements each stacklet is not too expensive, in a register/memory architecture. Processing of each thread is serial. Fair instruction density, because it is fairly close to the multiple nested value dataflow patterns. Scalar, single stack Multiple stacklets are nested on a single CPU stack. Switching the focus among the stacks requires pushing/popping entire substacks because action can happens only at the stack top. Independent threads of action cause problems because the real stack top must be multiplexed unpleasantly. Good instruction density, because it exploits very well the nested nature of most computation flows[8]. Unfortunately encoding density is not perfect, because to switch the focus from one stacklet to another takes a number of push and pop operations. Fortunately many non numeric applications do not have more than one thread of computation. Scalar, multiple stack There is a small number of CPU stacks, of limited depth, with overflow/underflow spilling/refilling, either hardware or software assisted. Each stacklet is mapped to a different hardware stack, and carries an independent thread of the computation. Typically only two CPU stacks will be in frequent use[9], even if four will usually be provided. Four CPU stacks will enable mapping each stacklet for most applications. A depth of four for each CPU stack will result in no or rare overflows most of the time. Switching the focus between stacklets is instantaneous. Encoding density is excellent, because it maps exactly on the expected value dataflow patterns. Superscalar, registers, load/store There are multiple focus points, because each stacklet contained in the register file can be served by a different functional unit. If[10] the register file is multiported, access to the independent stacklets is fast, but a lot of interconnect is required, because the stacklets may be spread all around and intermingle arbitrarily in the register file, again as there is too much generality and worst case assumptions. If there are more functional units than there is space in the register file for as many stacklets, trouble ensues, unless exoterica are used like register renaming[11]. Superscalar, registers, traditional Superscalar, single stack They can be made to work, at the price of converting on the fly the instruction stream to that for another type of architecture[12]; the difficulty is that the presence of multiple focuses because of multiple functional units makes for trouble, as both have to multiplex multiple stacklets onto a single stack. Since "registers, traditional" only uses the memory stack for multiplexing the bottoms of the stacklets, it can be adapted to superscalar operations better than "single stack"[13]. Superscalar, multiple stack There are multiple CPU stacks, and each computation stacklet is mapped onto it. Each CPU stack need not be multiported; it can even be dedicated to a single functional unit, even if I wouldn't. Only the stack tops can be transferred to/from other stacks. Computations in each CPU stack can proceed independently, because this is how application data flows are structured. The push and pop operations could have two forms, a short one for addressing directly the on chip cache (which would become a statically managed cache, something similar to a register file, but not quite), or the cache could be a more conventional one, and push and pop have only full address forms. More functional units that there are hardware stacks is a bad idea[11]. Superscalar, dataflow In this organization the CPU is structured internally as a network of functional units, and values (and operators) flow in this network. Unless the functional unit mix required by the application is very different by that provided in the network, this will dynamically adapt to any value dataflow patterns exhibited by the application, as well as to any number and type of functional units, without recompilation. Unfortunately this requires fairly hefty dynamic scheduling in hardware, and all this generality and flexibility is rarely needed[14], and this makes dataflow machines useful only for a few very wild application. Well, I hope you have followed me down to here. I have been trapped into spilling the beans on what I think is a correct model of computational activity and its implication on the structure of an architecture (register based, either load/store or traditional, and stack based, either single or multiple), and its implementation (scalar or superscalar). Hope it has been useful to you as well, especially in debunking certain myths about optimizers, registers, superscalars, ... ================ [1] But I received by e-mail a report that in some cases there are more than 128! [2] One that uses profiling, or even better one that believes 'register' or uses heuristics -- I think Hank Dietz's group was into this sort of things -- to estimate the dynamic frequency of use of values. [3] Spare means mostly for computing expressions, global for aching values across expressions, not even procedures. Interprocedural caching is of little benefit except in particular cases. [4] For example, I have received some interesting mail that shows that in order to keep the pipeline and functional units of a MIPS R3000/R3010 CPU going full tilt, you only need six floating point registers. [5] We should not be interested in how many registers we can make use of, but in how many registers we can productive use of, and what this is function of, and how much does this cost. [6] But see David Wall's classic paper about global optimization to see that in his extreme examples the knee of the curve is at 8 registers if dynamic frequency is taken into account, versus three times as many if only static frequency is used. [7] Any value can interact with any other value, but instead nesting is so prevalent in actual value dataflow patterns. [8] Note that a pure stack architecture is load/store too: only push and pop address memory. It does away with registers, because the operands are always implicit, so all instructions except push and pop have only the opcode (a "syllable", in Burroughs terminology, or a "word" in Forth). [9] one for addessing calculations, the other for nonfloat calculation; a third CPU stack will be in use if there are float calculations. [10] Well, we could remove the "if" and substitute it with an "as" here! [11] But then if the application value data flow patterns show more than four independently exploitable nests of computation, it is no longer a general purpose application as commonly intended. Use a Cray, a DAP, a Teradata, a Connection Machine or any other SIMD/MIMD thingie. [12] I seem to remember that this is how modern 370 architectures are implemented; the 370 interior decor is emulated by a more or less loose collection of functional units connected by dataflow paths. IBM have always seen the 370 (or other) interior decor as little more than a declarative language, to be interpreted on the fly. In the S/38 the interior decor is actually translated, more or less statically, to different microcodes on different models. [13] However for both it is probably better to go the multiple CPU, rather than the multiple functional unit, route. Note that, intriguingly, for most workloads, four is also the knee of the curve for macroparallelism. Maybe R. A. Wilson's Law of Fives should be a Law of Fours :-). [14] Because general purpose application dataflow patterns (a mantra by now!) have fairly predictable shapes, as above, and full dataflow machines do not exploit this advance knowledge. Not that I think of it, they could, though, probably, by special casing. Could be interesting. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk