Xref: utzoo comp.unix.wizards:16653 comp.lang.c:19161 Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!well!pokey From: pokey@well.UUCP (Jef Poskanzer) Newsgroups: comp.unix.wizards,comp.lang.c Subject: Optimal for loop on the 68020. Keywords: for ( i = COUNT; --i >= 0; ) Message-ID: <11993@well.UUCP> Date: 5 Jun 89 00:18:11 GMT Reply-To: Jef Poskanzer Organization: Paratheo-Anametamystikhood Of Eris Esoteric, Ada Lovelace Cabal Lines: 225 SUMMARY ------- I compiled the following different kinds of for loops: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) for ( i = 0; ++i <= COUNT; ) for ( i = 0; i++ < COUNT; ) for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) for ( i = COUNT; --i >= 0; ) for ( i = COUNT; i-- > 0; ) on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the generated code and the timings. COUNT was a small (< 127) compile-time constant. The loop body did not reference i and had no side-effects. In theory, all eight of these loops should have optimized to the most efficient loop possible, ignoring the otherwise unreferenced variable i and simply traversing the loop body the proper number of times. On the 68020, this most efficient loop is a dbra instruction. But in fact, cc never generates dbra, and gcc generates it for only one of the above loop types, and only when the -fstrength-reduce flag is specified. In contrast, the BSD cc on a vax generates a single instruction loop in four out of the eight cases, including the two most commonly-used ones. This is not because the vax compiler is any smarter -- the fact that it wins only half the time, instead of all the time, shows that it is also failing to recognize that it can freely optimize the loop. The only reason the vax's cc does better is that the vax has a richer instruction set. Since the loop overhead for dbra is 1.7 times less than for the code from the common for loops, it is useful to know how to provoke your compiler into generating it. But it would be more useful for compilers to be smarter. And more generally, those of you who blithely assume (as I used to) that compilers are smart enough to turn clearly-expressed C into optimal assembler, should perhaps look at their assembler output more often. FULL DATA --------- Case 1) First, let's look at the generic loops, counting up from zero to the limit: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 tag: tag: addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT-1,d3 cmpl d1,d6 cmpl d0,d3 jlt tag jge tag 0.97 usecs 0.97 usecs The two compilers generate essentially the same instructions. Pre or post increment doesn't make any difference. Note that the moveq of COUNT could be moved above the loop -- the loop body doesn't doesn't call any subroutines that might smash it. This looks like a missed optimization. Case 2a) Second, let's try doing the increment and test in one C statement. Seems to me this shouldn't make any difference to a smart optimizer, but... for ( i = 0; ++i <= COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: tag2: tag2: addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT,d3 cmpl d1,d6 cmpl d0,d3 jle tag1 jge tag1 0.97 usecs 0.97 usecs No important difference from case 1. Case 2b) But if we try the post increment version: for ( i = 0; i++ < COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: tag2: tag2: movl d6 d0 addql #1,d0 addql #1,d6 moveq #COUNT+1,d3 moveq #COUNT,d1 cmpl d0,d3 cmpl d1,d0 jgt tag1 jlt tag1 1.11 usecs 0.97 usecs Gcc is smart enough to bias the loop count, but cc doesn't see that optimization and so adds an extra instruction. Case 3) Third, let's try a count-down loop: for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 tag: tag: subql #1,d6 subql #1,d0 tstl d6 tstl d0 jgt tag jgt tag 0.83 usecs 0.83 usecs Here the two compilers generate exactly the same instructions. There is one less instruction than in the count-up case, because the compilers are smart enough to use the tstl instruction to compare against zero, and so they don't need the moveq #COUNT instruction (which shouldn't have been in the loop in the first place). Case 4a) Fourth, let's try a count-down loop with the decrement included in the test statement: for ( i = COUNT; --i >= 0; ) cc -O gcc -O gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 moveq #COUNT,d0 jra tag2 jra tag2 jra tag2 tag1: tag1: tag1: tag2: tag2: tag2: subql #1,d6 subql #1,d0 dbra d0,tag1 jge tag1 jpl tag1 clrw d0 subql #1,d0 jcc tag1 0.70 usecs 0.70 usecs 0.57 usecs Cc and gcc generate code similar to case 3, except that they suddenly decide that subql sets the condition codes just fine by itself, and a separate tstl is not necessary. So they shave off an instruction. Seems like a peephole optimizer should have picked this up in the previous cases too; it's another missed optimization. But the big win here is with the -fstrength-reduce option in gcc. It actually generates the optimal one-instruction loop, for a factor of 1.7 reduction in loop overhead from the generic loop. Not bad. But wait! What's that chud after the loop? Let's see, clear d1 to zero, subtract one from it giving -1 and setting carry, and jump if carry is clear. Hmm, looks like a three-instruction no-op to me! I guess gcc wants to leave the loop index with the right value, and isn't smart enough to notice it isn't used. (But why not simply moveq the final value?) Oh well, since it's not inside the loop it doesn't make much difference. Case 4b) But if we try this with post decrement: for ( i = COUNT; i-- > 0; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 jra tag2 jra tag2 tag1: tag1: tag2: tag2: movl d6,d0 subql #1,d0 subql #1,d6 moveq #-1,d3 tstl d0 cmpl d0,d3 jgt tag1 jlt tag1 0.97 usecs 0.97 usecs Cc not only adds in the movl to save the unincremented value of i, it also somehow fails to realize that subql sets the condition codes, so the tstl is back. And gcc gets totally confused and brings in a literal -1. CONCLUSION ---------- For both compilers and all levels of optimization, this loop: for ( i = COUNT; --i >= 0; ) gives the lowest overhead. Using gcc -fstrength-reduce, this low overhead means the optimal single-instruction loop; but even for cc and for gcc without -fstrength-reduce, this loop is faster than the others. I don't show the results here, but this is also true for large constant loops and for variable-length loops. It is unfortunate that we have to use such a non-intuitive loop to get the best results -- especially unfortunate since it is probably *not* the best loop on some other architectures or compilers -- but that's the facts. I would be interested to see similar checks done on different architectures; in particular RISC machines, which supposedly are designed in cooperation with the compiler writers. --- Jef Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey "Man invented language to satisfy his deep need to complain." -- Lily Tomlin