Path: utzoo!attcan!uunet!mcsun!ukc!dcl-cs!aber-cs!rupert!pcg From: pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.lang.misc Subject: Re: Common subexpression optimization Summary: properly designed languages are rare, unfortunately. Message-ID: Date: 1 Feb 90 16:48:20 GMT References: <4561@scolex.sco.COM> <14214@lambda.UUCP> <2217@sunset.MATH.UCLA.EDU> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 238 In-reply-to: pmontgom@sonia.math.ucla.edu's message of 30 Jan 90 06:11:17 GMT In article <2217@sunset.MATH.UCLA.EDU> pmontgom@sonia.math.ucla.edu (Peter Montgomery) writes: In article pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes: >The same is true for common subexpression elimination; if there >are common subexpressions, and it is *important* that they be >shared, the programmer has better make those explicit, because >the common subexpression usually has an interesting reason for being >common. In my programs, common subexpressions often appear as a result of macro expansions. Others are not visible at the source level, but arise when producing the object code. Of course. But this is the result, most often, of insufficient language design, both as to the need to use macro expansion, and as to the hidden code problem (as I try to demonstrate later on). Let me give one, but quite general example; common subexpressions in fortran and other lanuages often arise from repeated subscript calculations. Well, this most often happens because there are no have slices, or, even better, array iterators. Consider the usual fragment for matrix product (dimensions are [m,o] [o,n]), written in some pseudo language: for i in 1..m for j in 1..n c[i,j] = 0 for k in 1..o c[i,j] = c[i,j] + a[i,k]*b[k,j] Of course there are ample opportunities for common subexpression elimination here, and other hairy optimizations. What about instead: for i in 1..m let s[] alias c[i,], fast ronly h[] alias a[i,] for j in 1..n let fast e alias s[j] initially 0, fast ronly v[] alias b[,j] for fast k in 1..o e += h[k]*v[k] (which can be readily written in Algol 68, which was *carefully* designed for easy, efficient code generation) which is vastly more descriptive as well, as it makes obvious we are using a vector product on one hand and that we *know* that e,h,v,k are the variables to cache, the others are irrelevant, and that the compiler need not bother "optimize" ? In a previous discussion on a similar theme, we compared C code for this, written both ways, the first subjected to a very highly optimizing compiler. Well, the second version was faster by two or three times. Then the following version for s[] over c[*,], ronly h[] over a[*,] for p over s[*], ronly v[] over b[,*] p = h innerprod v (ronly h[]) innerprod (ronly v[]) is let fast d initially 0 for fast ronly h_i over h[*], fast ronly v_j over v[*] d += h_i*v_j; return d would be in many ways less procedural and highly informative, and allow for a lot of safe optimization (coupled with some heuristic on expected access frequencies), especially on vector machines. Somebody may claim that the style I advocate is "unnatural", because the first version is more similar to the habits of a mathematically trained person. Let me argue that: 1) As to this particular example, I think that matrix product as usually described uses poor notation. I am all in favour of using more abstract, terse notation in mathematics. There are unfortunately many people that love lots of hanging indexes... 2) Programming is not the same as mathematics. You often have to change *radically* technique when writing a program. Risch-Norman for symbolic integration is totally unsuitable for human execution, and has no resemblance to "by parts", etc... Programming languages of the sort used for time critical applications are not "non procedural" yet, and I think never will be, and *the programmer* must adapt to them, not viceversa. 3) If thehot spot is localized, and is *important*, one doesn't give a lot about naturalness, even if I maintain that often one can still do a clean job. Now back to your example. Your code contains: MUL2(a1, b1, c1, d1) MUL2(a1, b2, c1, d2) MUL2(a2, b1, c2, d1) MUL2(a2, b2, c2, d2) . and you comment that On the SPARC, the macro is being defined not to produce the first expansion but rather is defined so MUL2(a, b, c, d) becomes (TRI(a+b) + TRI(c+d)) - (TRI(a) + TRI(c)) - (TRI(b) + TRI(d)), where TRI is a table of triangular numbers (i.e., TRI(n) = n*(n+1)/2). Ahhhhhh.... So. You want to avoid slow SPARC multiplications and you use a table. This is highly unobvious; you add that It is easily to verify that this is functionally equivalent to the first expansion, if the TRI table is sufficiently large and if no integer overflow occurs. which means that this highly machine dependent (because your space vs. time *tradeoff* is such) code must also be supported by easy but non obvious boundary conditions and mathematical reasoning. In other words, *major* wizardry, which should be shouted about, not hidden within different MUL2 expansions. After the macro is expanded, the compiler sees the four expressions (TRI(a1+b1)+TRI(c1+d1)) - (TRI(a1)+TRI(c1)) - (TRI(b1)+TRI(d1)) (TRI(a1+b2)+TRI(c1+d2)) - (TRI(a1)+TRI(c1)) - (TRI(b2)+TRI(d2)) (TRI(a2+b1)+TRI(c2+d1)) - (TRI(a2)+TRI(c2)) - (TRI(b1)+TRI(d1)) (TRI(a2+b2)+TRI(c2+d2)) - (TRI(a2)+TRI(c2)) - (TRI(b2)+TRI(d2)). As previously mentioned, this is a time-critical loop, and the macro definition has been carefully written so the subexpressions (TRI(a1)+TRI(c1)), (TRI(b1)+TRI(d1)), (TRI(a2)+TRI(c2)), (TRI(b2)+TRI(d2)) appear twice in the resultant expansions. The module which has the four references to MUL2 does not know about this; it would defeat information hiding if I expanded everything there and removed the common subexpressions explicitly. I think that a different choice of abstraction boundaries would do it; probably it is not MUL2 the abstractions layer you should care about, but instead the four successive calls. Here you are playing into my hands; in your application, what is central is indeed the effect achieved by the block of four calls, not that it is achieved by each multiplication step. In practice, a1, a2, c1, and c2 are loop invariants (but b1, b2, d1, and d2 vary on each iteration). But this is very interesting information again, which should be duly obvious to the reader of your code. Who reads your code, be it human or program, should not discoer this important property by him/her/itself. [... because of expensive array indexing on SPARC maybe ...] I want the compiler to optimize the array indexing, perhaps by computing the addresses TRI, TRI+4*a1, TRI+4*a2, TRI+4*c1, TRI+4*c2 outside the loop, with the inner loop using 4*b1, 4*b2, 4*d1, and 4*d2 as offsets. This requires the compiler to rewrite the address TRI+4*(a1+b1) as (TRI+4*a1)+(4*b1), recognizing that TRI+4*a1 is a loop invariant and that 4*b1 is also part of the address TRI+4*b1. I cannot make these optimizations at the source level. In C you can, for example. Step a pointer thru the arrays. This would be a way to make obvious that you are building your result by sequential scan of the tables. Again, e.g. for virtual memory, if the application is really critical, this might be a machine dependent valuable piece of information, or not. As another example, on many cached machines it is *vital* to do memory to memory copies that are aligned on cache line boundaries in the destination. This is something that is highly interesting, as it makes clear some unobvious design choices. Even on systems using the first macro expansion (in terms of multiplications), there may be hidden common subexpressions. For example, suppose a1*b1 + c1*d1 is done by floating point hardware, with all variables being converted from integer to floating beforehand. If this is important, I think you could well be doing it manually, just like the avoidance of multiplications above. It would also make things clearer to who reads your code, and the hazards involved (floating point has quite different properties from integer arithmetic). This conversion to floating point does not appear in my source code, but I would want the compiler to recognize that variables appearing repeatedly in products need be converted to floating only once, and that a1, a2, c2, c2 can be converted outside the loop. This will be most easily done if you let the readers of your program know that those are loop invariants. The it just takes a good code generator to choose the most efficient strategy for carrying out your wishes. Of course, if the application is really time critical, you also want to second guess your compiler and architecture. Unavoidable. In summary, I think your example is helping my argument; you take great pains to do manual micro optimization (eschewing multiplies, for example), but then you do this at what I deem too low a level, and you don't include important information as to whether some variables are invariant, and as to whether intermediate passage to floating point is safe. Well, it's mostly a difference of programming style. You may not like mine, but I think it can still cope with problems like this. On the other hand, I cannot make these optimizations at the source level. which is a limit of current languages for which compilers that are too clever by half (many of the optimizations you want your compiler to do above require non obvious reasoning -- all too many optimizers generate optimized code that is not equivalent to unoptimized code). As such, we are all poor victims of the status quo. Eventually, my conclusion is that if one wants smart compilers, one had better have languages that make that easier, less procedural and more descriptive. For languages like C or Fortran, for which the compiler must do a painful and often incorrect deductive work if second guessing the programmer is desired, I hold that careful manual optimization of the few relevant lines is both preferable and more effective, and safer, and the compiler should help with that (that's why I think that "register" in C is such a big win). On the other hand both are expensive in terms of man power, so we see monstrosities such as the many hundred millions of dollars that have been sunk in vectorizing compilers for fortran that try to second guess programmers that wrote for long dead architectures. The thought of using more appropriate language technology seems to be heretical. Well, dusty decks have a morbid fascination. Many could not do without their bugs (and don't remember what Knuth wisely observed on this matter in his volume 1 about rewriting being better -- of course if you have the manpower to do it). -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk