Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!sun-barr!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.lang.c
Subject: Re: Some optimization results
Message-ID: <8834:May1504:18:2391@kramden.acf.nyu.edu>
Date: 15 May 91 04:18:23 GMT
References: <3501@charon.cwi.nl> <14077:May1321:14:2691@kramden.acf.nyu.edu> <22723@yunexus.YorkU.CA>
Organization: IR
Lines: 77

In article <22723@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes:
  [ an implication that hand optimization only gains a few percent ]

Sorry, Ozan, but you're wrong. The yabba code after hand optimization
was slightly over two times faster than the code before. As another
example, I've taken the ``compress'' source and optimized it by hand;
the result is 30% faster for compression and 50% faster for
decompression. That is a lot more than a few percent.

> >I can keep pointing out how hand optimization gives huge speedups.
>  wollman@emily.uvm.edu (Garrett Wollman) points out elsewhere:

Where is that? All reports of yabbawhap problems should either be sent
to me or be crossposted to comp.sources.bugs. How am I supposed to fix
them otherwise?

>  | [For example, yabba will not work for (some subset of the set of)
>  | large binary files on our SGI 4D/340S.

Dunno about this. I've received five reports of yabba failures, and in
three of the cases it was simply misconfigured in ways that checkconf
(which is part of the package) would have detected had it been run as
per instructions. I haven't received word yet on the fourth and fifth.

> Dan also spent so much effort
>  | unrolling loops that the source is nearly incomprehensible,

I have no problem understanding it. I really didn't spend much effort on
the unrolling; it was just a matter of taking a subroutine like this:

#define DOWN0(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \
 if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \
   do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \
   if (no) { FOUND } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
}

and adding a parallel version like this:

#define DOWN1(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \
 if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \
 if ((no) = NEXT(n,no))  { if (PARENT(p,no) != (sp)) { \
   do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \
   if (no) { FOUND } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
}

plus a DOWN macro to select between DOWN0 and DOWN1 (as well as DOWN2
and DOWN5). Since this unrolling is all isolated inside a separate
routine with a well-defined interface, it doesn't hurt maintainability.
(Admittedly, I'm the only one with a copy of that interface definition
at the moment; see the top of huptrie.h for further comments.)

The most important hand optimizations in yabba were data structure
rearrangements: -DPTRS (though -UPTRS is faster on the Convex, and other
machines where a[i] is as fast as a[0]), combining two arrays into a
single array of structs, etc. It'll be a long time before compilers can
do anything like this.

> and breaks
>  | a parallelizing preprocessor which can be used by the MIPS/SGI
>  | compiler to great effect.]

There is an important issue here: If an optimization will slow down the
code on some machine, then you should always provide a way to turn the
optimization off. In this case, -DOPTCANT1 is documented to turn off the
unrolling, and if Wollman's preprocessor is buggy then he should have
followed the instructions and compiled with that option. -DPTRS is a
similar example---I made that optimization with the Convex in mind, and
made sure that it could be easily toggled on or off. And so on.

On the bright side, note that yabbawhap has so far found at least two
optimizer bugs, one in a prerelease version. At least in the latter
case, the bug will never see the light of day.

---Dan