Xref: utzoo comp.arch:20142 comp.lang.misc:6470 Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!usc!samsung!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch,comp.lang.misc Subject: Re: Register Count Summary: Blowing my chances to get a software patent ;-) ... Message-ID: Date: 10 Jan 91 19:44:01 GMT References: <11538@pt.cs.cmu.edu> Sender: cho@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 207 Nntp-Posting-Host: odin In-reply-to: pcg@cs.aber.ac.uk's message of 10 Jan 91 16:23:01 GMT On 10 Jan 91 16:23:01 GMT, pcg@cs.aber.ac.uk (myself) said: [ ... usual verbose treatise with possibly useful & interesting but not at all new things on value data flow patterns and implications for architectures and their implementations ... ] Rereading this article I have realized that there is a problem in the regular rerunning of the 'register count' and 'aggressive optimization' debates, that I know something that, as far I know up to a couple years ago, nobody else knows, and this gives me, I suppose, some insight that permeates the tone of my arguments without ever being made explicit. I have an unpublished[1] paper on how to implement multiple parallel assignment in the general case, which is something that apparently is still not generally known or published. The multiple assignment is, I have come to see, a very important thing, with far reaching consequences in at least two fields; I will discuss those for architecture[2]. I will first present the problem, then a sketch of the solution, and then the implications for system architecture. Multiple parallel assignment can be written as something like L1, L2, L3 := R1, R2, R3 Where the L's are lvalue producing expressions, and the R's are value producing expressions. The previous values of all the L's are replaced simultaneously and in parallel (at least conceptually) with the correcponding R's. Its formalization is the trivial extension of the single assingment one. Examples: x, y := w, z; CO x := w; y := z; CO a, b, c := b, c, d; CO slide a row of values CO a, b, := b, a; CO swap the values of a and b CO a, b, c := b, c, a; CO rotate a row of values CO a[i], i := i, a[i]; CO quite a bit more difficult CO As you can see there are four levels of difficulty; no dataflow dependecies, L-to-R dataflow dependencies, L-to-L dataflow dependencies, both. I have devised an obvious but complicated algorithm for expanding an _arbitrary_ multiple assignment into an equivalent sequence (or multipel sequences) of single assignments, with the use of the *minimum* number of temporaries[3] necessary, or any other quantity desired, depending on the desired degree of parallelism. So that nobody asks me for the draft of the paper, here is an outline of the algorithm: build two boolean matrixes, LR for L-to-R(-to-L) dependencies, and LL for L-to-L dependencies, an element is true if there is a dependency[4] betwen the two expressions. In other words, if N is the arity of the multiple assignment: for i,j in 1..N,1..N LL[i,j] := (i=j | 0 | Li Lj) LR[i,j] := (i=j | 0 | Li Rj) You now have got the characteristic matrixes of two directed graphs. Apply Dijkstra's cycle finding algorithm[5] to the matrixes, repeatedly, by doing the row sums and emitting one of the assignments that have zero row sum. If there aren't any, there is a cycle. If there are several row sums with the same value, select one at random (they are all part of cycles of the same degree), introduce a temporary, emit the relative assignment, and restart. Whenever an assignment to Li is emitted, delete the i'th row and column of LL and LR. That's all, in outline[6]. It is easy to prove that this generates the assignments in the right order, and introduces the minim number of temporaries. Note that the order in which single assignments are emitted if more than one row has zero row sum, or in which cycles are resolved if they have the same row sum, is a matter for scheduling policy. Also, some single assignments that have higher row sums than others can be emitted before those if temporaries are introduced to reduce those row sums[7]. The temporaries can be seen as staging places that uncouple two strands of microparallelism that would otherwise get entangled. Now the implications for system architecture, that is microparallelism. I can claim that a very large percentage of existing codes can be recoded with multiple assignments with large benefit in clarity. Anybody can think of examples[7]. I also claim that in all these important cases programmers are forced to hand expand the underlying myltiple assignment into a series of single assignments, and that a large, or even all, of the dataflow analysis of optimizers is about reversing the process and rebuilding the underlying multiple assignment, and possibly reexpanding it in a different form. If you look at the world like this, you find that instruction scheduling, the maxmimum degree of microparallelism, caching in registers across statements, common subexpression eleimination, and many other issues are just a consequence of expanding multiple parallel assignment into signle assignment, to be executed either serially or in parallel. It is obvious, at least to me, that a lot (if not all!) of so called aggressive optimization and clever register using, as opposed to competent code generation, is the regrettable result of multiple assignment not being in the language and being haphazardly reinvented every time by aggressive optimizers, and other funnily misapplied things like graph coloring. With the double hazard that the programmer is writing a very simple and easily understood multiple assignment as its *implementation* as a sequence of single assignments, and the compiler is trying to reverse engineer that. It is also obvious to me, as I have seen the trivially easy but complicated "proof" that the above described algorithm works, that this implies that in many many cases both the programmer and the optimizer must "prove" that the sequence of single assignments is indeed equivalent to the underlying multiple assignment, which is tedious and complicated, when it could be done in a reliable way as part of code generation if languages were designed with multiple assignment built in[9] and with this algorithm in a "tunable" version as to the scheduling of the single assignments, and to the number of temporaries to use and when. Especially on multiple stack superscalar machines[10], whether the stacks are uselessly implemented as portions of a flat register file or not. I had to get this off my chest, sooner or later. Most probably others have founbd the same algorithm, but if so I have never seen any indication that they attributed the same importance to it as I did when I considered all the implications[11]. If you think that they are less important than I think, reconsider; for example, read Suzuky's paper, or try to rewrite your favourite algorithms and inner loops using multiple assignments, and wonder how the algorithm could expand them on different architectures. ================ [1] Submitted for publication almost five years ago to the CJ, accepted with revisions, revisions never done :-(. I have not been in the right environment for doing or even polishing research. [2] The other one is program formalization and correctness, as in stateless or stateful languages. [3] The general form of multiple assignment can always be implemented by calculating each or the L's and each of the R's into a temporary, and then performing the single assignments between the temporaries in any order. This is what binding based, functional languages do. [4] The usual wise guys will remark that this is undecidable. In practice any weaker predicate will do, and will generate, for the simple expressions likely to be found in practice, good results. [5] The referee for my paper pointed out that I was really using topological sorting, thus missing the point entirely. Topological sorting and Dijkstra's cycle finding algorithm are equivalent (they both essentially deal with the transitive closure of a partial order operator), modulo the representations of the graph, but one has as goal finding the cycles, the other fails when there are any. Very different flavour. [6] There are additional obvious complications in selecting the optimal cycle reduction order if there are cycles in both the LR and LL matrixes, and other scheduling details, and handling cases like a, a := b, b CO well defined meaning CO a, a := b, c CO nondeterministic or illegal? CO I did a sketchy, minimal and dumb, implementation for Johnson's PCC, many years ago, in about one day. I think that adding it to any compiler would take about as much time. I have lost PCC's patches, and have not had time to add it to GNU cc or other similar technology. Sorry. [7] For example, consider "a, b := b, c". The LL matrix is all 0, and the LR matrix is 0 0 1 0 We can immediately write "L0 := R0", and then the matrix collapses, and we can write "L1 := R1", so we get the sequence a := b; b := c; If we do not like this scheduling, we can extend the matrix with the row and column for a temporary (note that a temporary never influences any L or R expression), and thus we can generate instead either of the sequences T1 := b; a := T1; b := c; CO quite silly really CO T1 := b; b := c; a := T1; CO scheduling inversion CO [8] Two particular forms of multiple assignment, sliding and rotation, are illustrated in an article in CACM, middle seventies, by Suzuki. [9] Please, not as in functional languages or languages with futures and other things that imply that you always use the maximum number of temporaries every time, as in function bindings passed by value. [10] If you think of the most applications rewritten with multiple assignments, and their dependency matrixes, it is easy to see that these multiple assignments would capture essentially all interesting microparallelism, and that they would not still be terribly complicated; rarely they would be more than four way and more than four deep each way. Even when grouped together in multiple assignments, the average expressions would still be pretty simple. Except for things like operations on whole vector/matrix (or other simple and regular structure), but then SIMD/MIMD is the right answer, not microparallelism. Massive, massive parallelism, which has its own, quite different, set of problems. [11] I stumbled upon it; I just wanted to add it to a language I was doing an implementation for, and was irked that all the references I could find at the time (for examples Gries, still in his recent books) said that the general problem, with arrays and both LL and LR dependencies, was intractable. On examining the algorithm when it was done I started to realize its implications for program writing and code generation. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk