Xref: utzoo comp.arch:20142 comp.lang.misc:6470
Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!usc!samsung!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch,comp.lang.misc
Subject: Re: Register Count
Summary: Blowing my chances to get a software patent ;-) ...
Message-ID: <PCG.91Jan10194401@odin.cs.aber.ac.uk>
Date: 10 Jan 91 19:44:01 GMT
References: <PCG.91Jan8175401@odin.cs.aber.ac.uk> <11538@pt.cs.cmu.edu>
	<PCG.91Jan10162301@odin.cs.aber.ac.uk>
Sender: cho@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 207
Nntp-Posting-Host: odin
In-reply-to: pcg@cs.aber.ac.uk's message of 10 Jan 91 16:23:01 GMT

On 10 Jan 91 16:23:01 GMT, pcg@cs.aber.ac.uk (myself) said:

    [ ... usual verbose treatise with possibly useful & interesting
    but not at all new things on value data flow patterns and
    implications for architectures and their implementations ... ]

Rereading this article I have realized that there is a problem in the
regular rerunning of the 'register count' and 'aggressive optimization'
debates, that I know something that, as far I know up to a couple years ago,
nobody else knows, and this gives me, I suppose, some insight that permeates
the tone of my arguments without ever being made explicit.

I have an unpublished[1] paper on how to implement multiple parallel
assignment in the general case, which is something that apparently is still
not generally known or published. The multiple assignment is, I have come to
see, a very important thing, with far reaching consequences in at least two
fields; I will discuss those for architecture[2]. I will first present the
problem, then a sketch of the solution, and then the implications for system
architecture.

Multiple parallel assignment can be written as something like

	L1, L2, L3 := R1, R2, R3

Where the L's are lvalue producing expressions, and the R's are value
producing expressions. The previous values of all the L's are replaced
simultaneously and in parallel (at least conceptually) with the
correcponding R's. Its formalization is the trivial extension of the
single assingment one.

Examples:

	x, y := w, z;		CO x := w; y := z; 		CO
	a, b, c := b, c, d;	CO slide a row of values	CO
	a, b, := b, a;		CO swap the values of a and b 	CO
	a, b, c := b, c, a;	CO rotate a row of values	CO
	a[i], i := i, a[i];	CO quite a bit more difficult 	CO

As you can see there are four levels of difficulty; no dataflow dependecies,
L-to-R dataflow dependencies, L-to-L dataflow dependencies, both.

I have devised an obvious but complicated algorithm for expanding an
_arbitrary_ multiple assignment into an equivalent sequence (or multipel
sequences) of single assignments, with the use of the *minimum* number of
temporaries[3] necessary, or any other quantity desired, depending on the
desired degree of parallelism.

So that nobody asks me for the draft of the paper, here is an outline of the
algorithm: build two boolean matrixes, LR for L-to-R(-to-L) dependencies,
and LL for L-to-L dependencies, an element is true if there is a
dependency[4] betwen the two expressions. In other words, if N is the arity
of the multiple assignment:

    for i,j in 1..N,1..N
	LL[i,j] := (i=j | 0 | Li <influences the value of> Lj)
	LR[i,j] := (i=j | 0 | Li <influences the value of> Rj)

You now have got the characteristic matrixes of two directed graphs.
Apply Dijkstra's cycle finding algorithm[5] to the matrixes, repeatedly,
by doing the row sums and emitting one of the assignments that have zero
row sum. If there aren't any, there is a cycle. If there are several row
sums with the same value, select one at random (they are all part of
cycles of the same degree), introduce a temporary, emit the relative
assignment, and restart.  Whenever an assignment to Li is emitted,
delete the i'th row and column of LL and LR. That's all, in outline[6].

It is easy to prove that this generates the assignments in the right
order, and introduces the minim number of temporaries.

Note that the order in which single assignments are emitted if more than
one row has zero row sum, or in which cycles are resolved if they have
the same row sum, is a matter for scheduling policy. Also, some single
assignments that have higher row sums than others can be emitted before
those if temporaries are introduced to reduce those row sums[7].  The
temporaries can be seen as staging places that uncouple two strands of
microparallelism that would otherwise get entangled.

Now the implications for system architecture, that is microparallelism.

I can claim that a very large percentage of existing codes can be
recoded with multiple assignments with large benefit in clarity. Anybody
can think of examples[7]. I also claim that in all these important cases
programmers are forced to hand expand the underlying myltiple assignment
into a series of single assignments, and that a large, or even all, of
the dataflow analysis of optimizers is about reversing the process and
rebuilding the underlying multiple assignment, and possibly reexpanding
it in a different form.

If you look at the world like this, you find that instruction scheduling,
the maxmimum degree of microparallelism, caching in registers across
statements, common subexpression eleimination, and many other issues are
just a consequence of expanding multiple parallel assignment into signle
assignment, to be executed either serially or in parallel.

It is obvious, at least to me, that a lot (if not all!) of so called
aggressive optimization and clever register using, as opposed to competent
code generation, is the regrettable result of multiple assignment not being
in the language and being haphazardly reinvented every time by aggressive
optimizers, and other funnily misapplied things like graph coloring. With
the double hazard that the programmer is writing a very simple and easily
understood multiple assignment as its *implementation* as a sequence of
single assignments, and the compiler is trying to reverse engineer that.

It is also obvious to me, as I have seen the trivially easy but complicated
"proof" that the above described algorithm works, that this implies that in
many many cases both the programmer and the optimizer must "prove" that the
sequence of single assignments is indeed equivalent to the underlying
multiple assignment, which is tedious and complicated, when it could be done
in a reliable way as part of code generation if languages were designed with
multiple assignment built in[9] and with this algorithm in a "tunable"
version as to the scheduling of the single assignments, and to the number of
temporaries to use and when.

Especially on multiple stack superscalar machines[10], whether the stacks are
uselessly implemented as portions of a flat register file or not.

I had to get this off my chest, sooner or later. Most probably others have
founbd the same algorithm, but if so I have never seen any indication that
they attributed the same importance to it as I did when I considered all the
implications[11]. If you think that they are less important than I think,
reconsider; for example, read Suzuky's paper, or try to rewrite your
favourite algorithms and inner loops using multiple assignments, and wonder
how the algorithm could expand them on different architectures.

================

[1] Submitted for publication almost five years ago to the CJ, accepted with
revisions, revisions never done :-(. I have not been in the right environment
for doing or even polishing research.

[2] The other one is program formalization and correctness, as in stateless
or stateful languages.

[3] The general form of multiple assignment can always be implemented by
calculating each or the L's and each of the R's into a temporary, and
then performing the single assignments between the temporaries in any
order. This is what binding based, functional languages do.

[4] The usual wise guys will remark that this is undecidable. In
practice any weaker predicate will do, and will generate, for the simple
expressions likely to be found in practice, good results.

[5] The referee for my paper pointed out that I was really using topological
sorting, thus missing the point entirely. Topological sorting and Dijkstra's
cycle finding algorithm are equivalent (they both essentially deal with the
transitive closure of a partial order operator), modulo the representations
of the graph, but one has as goal finding the cycles, the other fails when
there are any. Very different flavour.

[6] There are additional obvious complications in selecting the optimal
cycle reduction order if there are cycles in both the LR and LL
matrixes, and other scheduling details, and handling cases like

	a, a := b, b		CO well defined meaning		CO
	a, a := b, c		CO nondeterministic or illegal?	CO

I did a sketchy, minimal and dumb, implementation for Johnson's PCC,
many years ago, in about one day. I think that adding it to any compiler
would take about as much time. I have lost PCC's patches, and have not
had time to add it to GNU cc or other similar technology. Sorry.

[7] For example, consider "a, b := b, c". The LL matrix is all 0, and
the LR matrix is

	0 0
	1 0

We can immediately write "L0 := R0", and then the matrix
collapses, and we can write "L1 := R1", so we get the sequence

	a := b; b := c;

If we do not like this scheduling, we can extend the matrix with the row and
column for a temporary (note that a temporary never influences any L or R
expression), and thus we can generate instead either of the sequences

	T1 := b; a := T1; b := c;	CO quite silly really	CO
	T1 := b; b := c; a := T1;	CO scheduling inversion	CO

[8] Two particular forms of multiple assignment, sliding and rotation,
are illustrated in an article in CACM, middle seventies, by Suzuki.

[9] Please, not as in functional languages or languages with futures and
other things that imply that you always use the maximum number of
temporaries every time, as in function bindings passed by value.

[10] If you think of the most applications rewritten with multiple
assignments, and their dependency matrixes, it is easy to see that these
multiple assignments would capture essentially all interesting
microparallelism, and that they would not still be terribly complicated;
rarely they would be more than four way and more than four deep each way.
Even when grouped together in multiple assignments, the average expressions
would still be pretty simple. Except for things like operations on whole
vector/matrix (or other simple and regular structure), but then SIMD/MIMD is
the right answer, not microparallelism. Massive, massive parallelism, which
has its own, quite different, set of problems.

[11] I stumbled upon it; I just wanted to add it to a language I was doing
an implementation for, and was irked that all the references I could find at
the time (for examples Gries, still in his recent books) said that the
general problem, with arrays and both LL and LR dependencies, was
intractable. On examining the algorithm when it was done I started to
realize its implications for program writing and code generation.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk