Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.lang.misc Subject: Re: Aggressive optimization Message-ID: <29028:Oct1906:58:2890@kramden.acf.nyu.edu> Date: 19 Oct 90 06:58:28 GMT References: <13405:Oct1800:22:5690@kramden.acf.nyu.edu> <66071@lanl.gov> Organization: IR Lines: 41 In article <66071@lanl.gov> jlg@lanl.gov (Jim Giles) writes: > From article <13405:Oct1800:22:5690@kramden.acf.nyu.edu>, by brnstnd@kramden.acf.nyu.edu (Dan Bernstein): > > I do not remember ever making a mistake in hand optimization by the most > > fundamental standard technique: adding variable x to keep track of some > > intermediate quantity, and eliminating redundant variables given x. > _ALL_ production quality compilers I've ever used on mainframes can do > much better than all but the most adept programmer at this optimization. > Further, the adept programmer can usually do no better than the compiler. What?! This is absolutely unbelievable. In one of my last few articles I had a paragraph listing some of what a competent hand optimizer does regularly. A simple example, and one responsible for a 10% speedup in one section of code: A variable is incremented regularly, but never goes past 2^16 - 1. Whenever it reaches a power of 2, another variable (which is initialized appropriately) is incremented. It is the number of bits in the first variable, though I don't know any compiler that can figure this out. It is hence smaller than 16. In another piece of code, a loop (which is performed at most twice---I had to unroll it by hand, because the compiler couldn't figure it out) essentially subtracts 8 from that variable each time around, and references (1 << x) - 1 for x a certain expression computed from that variable. I knew x had to be between 0 and 7, so I made a table and replaced (1 << x) - 1 by a reference into that table. This took me hardly any thought; it was one of hundreds of hand optimizations that I performed on the same program. Those optimizations added up to a 3x overall improvement, over and above the 1.5-2x that the compiler got for being able to use the machine language directly. A less trivial example: In a heavily hand-optimized implementation of Nussbaumer's convolution method, I had to combine a 27-digit number's residues mod 10^9 + {1,3,5} into its three digits radix 10^9. A certain intermediate result was guaranteed to be between 0 and 28258, though the compiler wouldn't be able to figure this out without some rather amazing calculations with floor and modulus functions. I was able to take advantage of this to rewrite a large portion of the computation in a way that I knew was safe only because overflow was impossible. Jim, I bet you $100 that you have used a ``production quality compiler... on mainframes'' that cannot perform this optimization. Bet? Now what did you mean to say? ---Dan