Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!samsung!transfer!lectroid!jjmhome!smds!rh
From: rh@smds.UUCP (Richard Harter)
Newsgroups: comp.lang.c
Subject: Some optimization results
Message-ID: <444@smds.UUCP>
Date: 4 May 91 08:55:35 GMT
Organization: SMDS Inc., Concord, MA
Lines: 99

This is a case example of code optimization.  The context is that a 1024
byte page was being written to a binary file.  For reasons having to do
with infelicities in some NFS implementations the decision was made to
read the page back in and compare what was written with what was read.
[I am not defending or propounding this as the best approach; expert
opinions are solicited]  In any case, given that this was the functionality
of the page write routine, we wanted to improve its performance.  After
setting up a test suite and a profiled version (gprof) the timings were

self time 1.30 ms/call
total     2.58 ms/call

[The total includes one read, one write, and two lseeks].  How much time
was spent in the compare loop?  Comment it out and we get

self time  .02 ms/call
total     1.34 ms/call

Interesting.  A one line compare loop seems to be inordinately expensive.
What's going on here?  A quick check on the code shows that the readback
page is declared as an automatic array of 1024 bytes on the stack.  I
read on the net that this is a bad thing to do -- it plays hob with the
optimization.  So we make it static and get

self time  .45 ms/call
total     1.73 ms/call

Very interesting indeed.  We get a factor of three just by not putting
an array on the stack.  There's one to remember.  At this point our work
is mostly done -- a reduction from 1.73 to 1.34 (best possible improvement)
would only be another 20%.  However we might as well go for it.  Let's
look at the the loop code which looks like

	register char *p, *q;
	register int i;
	...
	for (i=1024;--i>=0;) if (*p++ != *q++) error_handler();

[Not an exact transcription.]  We already have register variables and
a countdown loop; nothing to be gained there.  The next obvious thing
to try is loop unrolling.  This knocks us down to .35 ms.  Nice but can
we do better?  Yes -- everything is aligned, so we can do integer compares.
This pushes it down to .30 ms/call.  Can we do better?  Yes indeedy.
Since this loop never fails (unless bad things have happened) we don't
need to execute conditional code; all we need to do is to log the result
of failure.  Our first try is something like this:

	register int *ip, *iq, i, check;
	...
	check = 0;
	for (i=32;--i>=0;) {
		check += (*ip++ != *iq++);
		... seven more replications ...
		}
	if (check > 0) error_handler();

This seems clever, but it doesn't buy us anything.  Why not?  Upon 
reflection it turns out that the expression, (*ip++ != *iq++), is being
forced to 1 or 0, so we haven't really gotten rid of the conditional code.
Aha.  We don't need numbers, we need 0 and non 0.  We can get 0 and non 0
out of the two things being compared with an xor (produces non 0 on
unequal) and accumulate into check with an bit-wise or (any non 0 compare
forces check to be non 0 thereafter.)  So the replicated loop line becomes

		check |= (*ip++ ^ *iq++);

And our final timings are:

Final                       Original

self/time  .20 ms/call      1.30 ms/call
total     1.46 ms/call      2.58 ms/call

What have we learned?  First of all we learned that careful optimization
of plausible looking code can reduce execution time of the code in question
by a big factor, e.g. a factor of 6.5 in this case.  Secondly the effect on
the total time is not nearly as signifigant (cost cut by 40%.)  Thirdly
we have guidelines on code optimization:

(a)	Make arrays static or malloced; don't make them automatic.
(b)	Use loop unrolling and count down loops.
(c)	Avoid using conditional code within tight loops
(d)	It's often better to run a loop to completion and test for
	failure afterwards than it is to test with the loop.
(e)	Bitwise operators can make a signifigant difference

Just as important there are limits to the value of local code optimization.
We could gain a factor of two just by removing the readback test (safety
issue.)  More importantly the most signifigant improvements can only be
made by changing the underlying program so that fewer writes have to be
made.

Note: The test machine was a SUN 4/110.  The code presented has comments
removed and does not use the actual layout and naming conventions style.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.