Xref: utzoo comp.unix.wizards:16677 comp.lang.c:19181 Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!rutgers!bellcore!clyde!ima!haddock!suitti From: suitti@haddock.ima.isc.com (Stephen Uitti) Newsgroups: comp.unix.wizards,comp.lang.c Subject: Re: Optimal for loop on the 68020. Keywords: for ( i = COUNT; --i >= 0; ) Message-ID: <13592@haddock.ima.isc.com> Date: 5 Jun 89 20:15:12 GMT References: <11993@well.UUCP> Reply-To: suitti@haddock.ima.isc.com (Stephen Uitti) Organization: Interactive Systems, Boston Lines: 72 In article <11993@well.UUCP> Jef Poskanzer writes: >I compiled the following different kinds of for loops: > for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) >on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the >generated code and the timings. COUNT was a small (< 127) compile-time >constant. A huge amount of time ago, i did this for the PDP-11 using the Ritchie compiler. This compiler is not awesomely bright, but i had the impression that it wasn't quite as bad as PCC (or even less portable). The optimal loop would use the "sob" (subtract one and branch if not zero) instruction. The code that caused it to be emitted was register i; i = COUNT; do { loop body; } while (--i); This is much clearer than anything with for(), since it tends to get (even stupidly) compiled to: register i; i = COUNT; mov $COUNT,r2 do { L1: loop body; loop body... } while (--i); sob r2,L1 The compiler seemed to be smart enough to know not to use it if the loop was too long. Thus, at worst, the sob would be replaced by: sub $1,r2 jne L1 The trouble with "for" loops is that it is defined as "test at the top", when most single instructions loops are "test at bottom". The VAX (PCC based) compiler's optimizer would convert many loops that used multiple instructions to use a single complicated instruction. Unfortunately, the VAX 780 (ubiquitous at the time) generally ran these slower. One case was the acbl (add, compare and branch longword). The optimizer would replace the three instructions (something like:) add $1,r2 cmp $END,r2 bne L1 with the "acbl" and all its arguments. Both codings take the same number of bytes (25!). The multiple instructions are faster (on the 780). Newer VAXen have different relative timing. I recall this case coming from the prime number sieve benchmark. The loop could not be easily converted to the do-while. I haven't gotten a chance to play with gcc (real work to do). I've heard it can do wonders, and thought it did better multi-statement optimization. Still, all the silly side-effects semantics of C make it hard to tell a compiler how to take mediocre source and produce good code from it. It is best to start with good source. Though i hardly ever write assembler anymore (and it seems never for speed or size), i've still yet to bump into a C compiler that produces code that i can't improve on at a glance. This may change if/when i start working with scoreboarded RISC processors (88K), in that instruction cycle counting looks hard (requiring more than a glance). Stephen.