Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!samsung!transfer!lectroid!jjmhome!smds!rh From: rh@smds.UUCP (Richard Harter) Newsgroups: comp.lang.c Subject: Some optimization results Message-ID: <444@smds.UUCP> Date: 4 May 91 08:55:35 GMT Organization: SMDS Inc., Concord, MA Lines: 99 This is a case example of code optimization. The context is that a 1024 byte page was being written to a binary file. For reasons having to do with infelicities in some NFS implementations the decision was made to read the page back in and compare what was written with what was read. [I am not defending or propounding this as the best approach; expert opinions are solicited] In any case, given that this was the functionality of the page write routine, we wanted to improve its performance. After setting up a test suite and a profiled version (gprof) the timings were self time 1.30 ms/call total 2.58 ms/call [The total includes one read, one write, and two lseeks]. How much time was spent in the compare loop? Comment it out and we get self time .02 ms/call total 1.34 ms/call Interesting. A one line compare loop seems to be inordinately expensive. What's going on here? A quick check on the code shows that the readback page is declared as an automatic array of 1024 bytes on the stack. I read on the net that this is a bad thing to do -- it plays hob with the optimization. So we make it static and get self time .45 ms/call total 1.73 ms/call Very interesting indeed. We get a factor of three just by not putting an array on the stack. There's one to remember. At this point our work is mostly done -- a reduction from 1.73 to 1.34 (best possible improvement) would only be another 20%. However we might as well go for it. Let's look at the the loop code which looks like register char *p, *q; register int i; ... for (i=1024;--i>=0;) if (*p++ != *q++) error_handler(); [Not an exact transcription.] We already have register variables and a countdown loop; nothing to be gained there. The next obvious thing to try is loop unrolling. This knocks us down to .35 ms. Nice but can we do better? Yes -- everything is aligned, so we can do integer compares. This pushes it down to .30 ms/call. Can we do better? Yes indeedy. Since this loop never fails (unless bad things have happened) we don't need to execute conditional code; all we need to do is to log the result of failure. Our first try is something like this: register int *ip, *iq, i, check; ... check = 0; for (i=32;--i>=0;) { check += (*ip++ != *iq++); ... seven more replications ... } if (check > 0) error_handler(); This seems clever, but it doesn't buy us anything. Why not? Upon reflection it turns out that the expression, (*ip++ != *iq++), is being forced to 1 or 0, so we haven't really gotten rid of the conditional code. Aha. We don't need numbers, we need 0 and non 0. We can get 0 and non 0 out of the two things being compared with an xor (produces non 0 on unequal) and accumulate into check with an bit-wise or (any non 0 compare forces check to be non 0 thereafter.) So the replicated loop line becomes check |= (*ip++ ^ *iq++); And our final timings are: Final Original self/time .20 ms/call 1.30 ms/call total 1.46 ms/call 2.58 ms/call What have we learned? First of all we learned that careful optimization of plausible looking code can reduce execution time of the code in question by a big factor, e.g. a factor of 6.5 in this case. Secondly the effect on the total time is not nearly as signifigant (cost cut by 40%.) Thirdly we have guidelines on code optimization: (a) Make arrays static or malloced; don't make them automatic. (b) Use loop unrolling and count down loops. (c) Avoid using conditional code within tight loops (d) It's often better to run a loop to completion and test for failure afterwards than it is to test with the loop. (e) Bitwise operators can make a signifigant difference Just as important there are limits to the value of local code optimization. We could gain a factor of two just by removing the readback test (safety issue.) More importantly the most signifigant improvements can only be made by changing the underlying program so that fewer writes have to be made. Note: The test machine was a SUN 4/110. The code presented has comments removed and does not use the actual layout and naming conventions style. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.