Path: utzoo!attcan!uunet!mcsun!ukc!dcl-cs!aber-cs!rupert!pcg
From: pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.lang.misc
Subject: Re: Common subexpression optimization
Summary: properly designed languages are rare, unfortunately.
Message-ID: <PCG.90Feb1164820@rupert.cs.aber.ac.uk>
Date: 1 Feb 90 16:48:20 GMT
References: <4561@scolex.sco.COM> <14214@lambda.UUCP>
	<PCG.90Jan29131938@rupert.cs.aber.ac.uk> <2217@sunset.MATH.UCLA.EDU>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 238
In-reply-to: pmontgom@sonia.math.ucla.edu's message of 30 Jan 90 06:11:17 GMT

In article <2217@sunset.MATH.UCLA.EDU> pmontgom@sonia.math.ucla.edu (Peter Montgomery) writes:

   In article <PCG.90Jan29131938@rupert.cs.aber.ac.uk>
   pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

   >The same is true for common subexpression elimination; if there
   >are common subexpressions, and it is *important* that they be
   >shared, the programmer has better make those explicit, because
   >the common subexpression usually has an interesting reason for being
   >common.

   In my programs, common subexpressions often appear
   as a result of macro expansions.  Others are not visible at the
   source level, but arise when producing the object code.

Of course. But this is the result, most often, of insufficient
language design, both as to the need to use macro expansion, and
as to the hidden code problem (as I try to demonstrate later on).

Let me give one, but quite general example; common subexpressions
in fortran and other lanuages often arise from repeated subscript
calculations. Well, this most often happens because there are no
have slices, or, even better, array iterators.

Consider the usual fragment for matrix product (dimensions are
[m,o] [o,n]), written in some pseudo language:

	for i in 1..m
	for j in 1..n
	  c[i,j] = 0
	  for k in 1..o
	    c[i,j] = c[i,j] + a[i,k]*b[k,j]

Of course there are ample opportunities for common subexpression
elimination here, and other hairy optimizations. What about instead:

	for i in 1..m
	  let s[] alias c[i,],
	      fast ronly h[] alias a[i,]
	  for j in 1..n
	    let fast e alias s[j] initially 0,
		fast ronly v[] alias b[,j]
	    for fast k in 1..o
	      e += h[k]*v[k]

(which can be readily written in Algol 68, which was *carefully*
designed for easy, efficient code generation) which is vastly
more descriptive as well, as it makes obvious we are using a
vector product on one hand and that we *know* that e,h,v,k are
the variables to cache, the others are irrelevant, and that the
compiler need not bother "optimize" ?

	In a previous discussion on a similar theme, we compared
	C code for this, written both ways, the first subjected
	to a very highly optimizing compiler. Well, the second version
	was faster by two or three times.

Then the following version

	for s[] over c[*,], ronly h[] over a[*,]
	for p over s[*], ronly v[] over b[,*]
	  p = h innerprod v

	(ronly h[]) innerprod (ronly v[]) is
	  let fast d initially 0
	  for fast ronly h_i over h[*], fast ronly v_j over v[*]
	    d += h_i*v_j;
	  return d

would be in many ways less procedural and highly informative,
and allow for a lot of safe optimization (coupled with some heuristic
on expected access frequencies), especially on vector machines.

Somebody may claim that the style I advocate is "unnatural", because
the first version is more similar to the habits of a mathematically
trained person. Let me argue that:

1) As to this particular example, I think that matrix product as
usually described uses poor notation.  I am all in favour of
using more abstract, terse notation in mathematics.  There are
unfortunately many people that love lots of hanging indexes...

2) Programming is not the same as mathematics. You often have to
change *radically* technique when writing a program. Risch-Norman
for symbolic integration is totally unsuitable for human
execution, and has no resemblance to "by parts", etc...
Programming languages of the sort used for time critical
applications are not "non procedural" yet, and I think never will
be, and *the programmer* must adapt to them, not viceversa.

3) If thehot spot is localized, and is *important*, one doesn't give
a lot about naturalness, even if I maintain that often one can still
do a clean job.

Now back to your example. Your code contains:

		    MUL2(a1, b1, c1, d1)
		    MUL2(a1, b2, c1, d2)
		    MUL2(a2, b1, c2, d1)
		    MUL2(a2, b2, c2, d2) .

and you comment that

	    On the SPARC, the macro is being defined not to produce the
    first expansion but rather is defined so MUL2(a, b, c, d)
    becomes (TRI(a+b) + TRI(c+d)) - (TRI(a) + TRI(c)) - (TRI(b) + TRI(d)),
    where TRI is a table of triangular numbers (i.e., TRI(n) = n*(n+1)/2).

Ahhhhhh.... So. You want to avoid slow SPARC multiplications and you use
a table. This is highly unobvious; you add that

    It is easily to verify that this is functionally equivalent to
    the first expansion, if the TRI table is sufficiently large
    and if no integer overflow occurs.

which means that this highly machine dependent (because your
space vs. time *tradeoff* is such) code must also be supported by
easy but non obvious boundary conditions and mathematical reasoning.

In other words, *major* wizardry, which should be shouted about, not
hidden within different MUL2 expansions.

    After the macro is expanded, the compiler sees the four expressions

	    (TRI(a1+b1)+TRI(c1+d1)) - (TRI(a1)+TRI(c1)) - (TRI(b1)+TRI(d1))
	    (TRI(a1+b2)+TRI(c1+d2)) - (TRI(a1)+TRI(c1)) - (TRI(b2)+TRI(d2))
	    (TRI(a2+b1)+TRI(c2+d1)) - (TRI(a2)+TRI(c2)) - (TRI(b1)+TRI(d1))
	    (TRI(a2+b2)+TRI(c2+d2)) - (TRI(a2)+TRI(c2)) - (TRI(b2)+TRI(d2)).

    As previously mentioned, this is a time-critical loop, and the
    macro definition has been carefully written so the subexpressions
    (TRI(a1)+TRI(c1)), (TRI(b1)+TRI(d1)), (TRI(a2)+TRI(c2)),
    (TRI(b2)+TRI(d2)) appear twice in the resultant expansions.

    The module which has the four references to MUL2 does not know
    about this; it would defeat information hiding if I expanded
    everything there and removed the common subexpressions
    explicitly.

I think that a different choice of abstraction boundaries would do it;
probably it is not MUL2 the abstractions layer you should care about,
but instead the four successive calls. Here you are playing into my hands;
in your application, what is central is indeed the effect achieved by the
block of four calls, not that it is achieved by each multiplication step.

    In practice, a1, a2, c1, and c2 are loop invariants
    (but b1, b2, d1, and d2 vary on each iteration).

But this is very interesting information again, which should be
duly obvious to the reader of your code. Who reads your code, be
it human or program, should not discoer this important property
by him/her/itself.

    [... because of expensive array indexing on SPARC maybe ...]
    I want the compiler to optimize the array indexing, perhaps by
    computing the addresses TRI, TRI+4*a1, TRI+4*a2, TRI+4*c1,
    TRI+4*c2 outside the loop, with the inner loop using 4*b1, 4*b2,
    4*d1, and 4*d2 as offsets.  This requires the compiler to rewrite
    the address TRI+4*(a1+b1) as (TRI+4*a1)+(4*b1), recognizing that
    TRI+4*a1 is a loop invariant and that 4*b1 is also part of the
    address TRI+4*b1.  I cannot make these optimizations at the
    source level.

In C you can, for example. Step a pointer thru the arrays. This
would be a way to make obvious that you are building your result
by sequential scan of the tables. Again, e.g. for virtual memory,
if the application is really critical, this might be a machine
dependent valuable piece of information, or not. As another
example, on many cached machines it is *vital* to do memory to
memory copies that are aligned on cache line boundaries in the
destination. This is something that is highly interesting, as it
makes clear some unobvious design choices.

    Even on systems using the first macro expansion (in terms of
    multiplications), there may be hidden common subexpressions.  
    For example, suppose a1*b1 + c1*d1 is done by floating point 
    hardware, with all variables being converted from integer to floating 
    beforehand.

If this is important, I think you could well be doing it manually, just
like the avoidance of multiplications above. It would also make things
clearer to who reads your code, and the hazards involved (floating
point has quite different properties from integer arithmetic).

    This conversion to floating point does not appear in my 
    source code, but I would want the compiler to recognize that variables 
    appearing repeatedly in products need be converted to floating only once, 
    and that a1, a2, c2, c2 can be converted outside the loop.

This will be most easily done if you let the readers of your program
know that those are loop invariants. The it just takes a good code generator
to choose the most efficient strategy for carrying out your wishes. Of
course, if the application is really time critical, you also want to
second guess your compiler and architecture. Unavoidable.

In summary, I think your example is helping my argument; you take great
pains to do manual micro optimization (eschewing multiplies, for example),
but then you do this at what I deem too low a level, and you don't
include important information as to whether some variables are invariant,
and as to whether intermediate passage to floating point is safe.

Well, it's mostly a difference of programming style. You may not like mine,
but I think it can still cope with problems like this. On the other hand,

	I cannot make these optimizations at the source level.

which is a limit of current languages for which compilers that
are too clever by half (many of the optimizations you want your
compiler to do above require non obvious reasoning -- all too
many optimizers generate optimized code that is not equivalent to
unoptimized code). As such, we are all poor victims of the status
quo.

Eventually, my conclusion is that if one wants smart compilers,
one had better have languages that make that easier, less
procedural and more descriptive. For languages like C or Fortran,
for which the compiler must do a painful and often incorrect
deductive work if second guessing the programmer is desired, I
hold that careful manual optimization of the few relevant lines
is both preferable and more effective, and safer, and the compiler
should help with that (that's why I think that "register" in C
is such a big win).

On the other hand both are expensive in terms of man power, so we
see monstrosities such as the many hundred millions of dollars
that have been sunk in vectorizing compilers for fortran that try
to second guess programmers that wrote for long dead
architectures.  The thought of using more appropriate language
technology seems to be heretical.

Well, dusty decks have a morbid fascination. Many could not do
without their bugs (and don't remember what Knuth wisely observed
on this matter in his volume 1 about rewriting being better -- of
course if you have the manpower to do it).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk