Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!ucsd!dog.ee.lbl.gov!elf.ee.lbl.gov!torek From: torek@elf.ee.lbl.gov (Chris Torek) Newsgroups: comp.lang.c Subject: micro-optimizing loops (was Help with casts) Summary: dispellation of nonsense Keywords: pointless, inane :-) Message-ID: <10191@dog.ee.lbl.gov> Date: 23 Feb 91 21:37:25 GMT References: <1991Feb21.040145.8678@cec1.wustl.edu> <409@ceco.ceco.com> <339@smds.UUCP> <414@ceco.ceco.com> Reply-To: torek@elf.ee.lbl.gov (Chris Torek) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 313 X-Local-Date: Sat, 23 Feb 91 13:37:25 PST [the original loop: for (i = 0; i < 100; i++) x += i; ] In article <409@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett) suggests rewriting the loop as >>> for (i=99; i; i--) >>> x += i; >In article <339@smds.UUCP> rh@smds.UUCP (Richard Harter) suggests instead: >> for (i=100;--i>=0;) >> x += i; In article <414@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett) writes: > I must disagree with you. For starters, consider the situation where >i == 0. Your loop adds 0 to x. (True; we could use for (i = 100; --i > 0;) x += i; instead, or equivalently `--i != 0', etc.) >Secondly, as my discussion that you deleted pointed out, performing >"--i>=0" would take more time than "i" `No it won't! Yes it will! No it won't! Yes it will!' :-) >The reason is that both i and 0 must be loaded into registers, compared, and a branch instruction executed for your version. My version >loads i into a register and performs a branch. This turns out not to be the case. There are a number of common `dumb' implementations for the two loops for (i = 100; --i > 0;) /* loop 1 */ x += i; and for (i = 99; i; i--) /* loop 2 */ x += i; and they all depend on the compiler's and machine's characteristics. (Note: for the discussion below I have assumed that `i' goes in a machine register. If not, the picture becomes even more complicated, as the loops then depend on which ALU operations, if any, are permitted on memory operands.) Most currently-popular machines can be placed in exactly one of these two categories: `has condition codes' or `does not have condition codes'. The former can be further divided into `has condition codes that are or can be set on arithemtic operations such as decrement' and `has condition codes that can only be set by test instructions'. The latter can be divided into `has branch on positive instruction' and `requires comparison against register'. (I am deliberately leaving modern RISCs out of the first part of this article.) On the condition-code machines, the first loop tends to compile to one of: # cc machine, compiler leaves test at top mov $100, r0 # i = 100 loop: dec r0 # decrement and set condition codes jle end # exit loop if i <= 0 # x += i, or whatever jmp loop # continue loop end: [3 instructions overhead] or: mov $100, r0 loop: dec r0 # decrement but do not set cc's tst r0 # set cc's jle end jmp loop end: [4 instructions overhead] or: # cc machine, compiler copies test to bottom mov $100, r0 dec r0 # optional: tst r0 jle end loop: dec r0 # optional: tst r0 jgt loop end: [2 or 3 instructions overhead] Thus, the loop itself contains either three or four `overhead' instructions if the compiler does not copy the test to the bottom, but only two or three if it does. If the compiler is smarter, it also notices that i=100,--i gives i=99, which is > 0, which means the top `jle' is unnecessary and the `mov $100, r0; dec r0' sequence can be simplified to `mov $99, r0'. # non-cc, branch on 0-compare; test left at top mov $100, r0 loop: dec r0 jle r0, end # jump if r0 <= 0 jmp loop end: [3 instructions overhead] or: # non-cc, branch on reg-compare; test left at top mov $100, r0 mov $0, r1 loop: dec r0 jle r0, r1, end # jump if r0 <= r1 jmp loop end: [3 instructions overhead] or (if the compiler is seriously stupid): mov $100, r0 loop: dec r0 mov $0, r1 jle r0, r1, end jmp loop end: [4 instructions overhead] or: # non-cc, branch on 0 compare; test copied to bottom: mov $100, r0 dec r0 jle r0, end loop: dec r0 jgt r0, loop [2 instructions overhead] end: or the same with the `mov $0, r1' added and the branches changed to `jxx r0, r1, end|loop' [3 instructions overhead]. Thus, the compiler tends to emit 2, 3, or 4 instructions of loop overhead, but you only get 4 if the compiler does not copy the test to the bottom of the loop. For the second loop [for (i = 99; i; i--)], we get one of these: # cc machine, does not matter whether arithmetic sets cc's; # compiler leaves test at top mov $99, r0 loop: tst r0 # set condition codes jeq end # exit loop if i==0 dec r0 jmp loop end: [4 instructions of overhead] or: # cc machine; compiler copies test to bottom # (compiler moves test to bottom) mov $99, r0 tst r0 jeq end loop: dec r0 # optional tst r0 jne loop # repeat if not 0 end: [2 or 3 instructions of overhead] On the non-cc machine, we get one of: # non-cc, test left at top mov $99, r0 # optional mov $0, r1 loop: jeq r0, end # or jeq r0, r1, end dec r0 jmp loop end: [3 instructions overhead] or: # non-cc, test left at top, really dumb compiler mov $99, r0 loop: mov $0, r1 jeq r0, r1, end dec r0 jmp loop end: [4 instructions overhead] or: # non-cc, test copied to bottom mov $99, r0 # optional mov $0, r1 jeq r0, end # or jeq r0, r1, end loop: dec r0 jne r0, loop # or jne r0, r1, end end: [2 instructions overhead] or: mov $99, r0 mov $0, r1 jeq r0, r1, end loop: dec r0 mov $0, r1 jne r0, r1, end [3 instructions overhead] Again, we get 2, 3, or 4 instructions of overhead, with 2/3 occuring when the compiler moves the test to the bottom and 4 occurring when it leaves the test at the top. >2 steps vs. your 4 steps. Nope. :-) Now consider modern RISCs, in which pipeline delays are exposed, i.e., you get one `free' instruction after a branch. Then: for (i = 99; i; i--) code; can be compiled as: mov $99, r1 # r0 always = 0 # compiler does not bother to test 99 != 0 loop: jne r1, r0, loop # branch if not 0 dec r1 # do this whether we branch or not # out of the loop now, but r1 = -1; # if it is used again, we must first set it back to 0. On the other hand, for (i = 100; --i > 0;) code; can be compiled as: mov $99, r1 mov $1, r2 loop: jgt r1, r2, loop # branch if i > 1 dec r1 # and always decrement i Both of these loops have exactly two instructions of overhead. (Note that the replacement of `--i > 0' with `i-- > 1' is legal since the only case where i-- < 1 but --i > 0 is when i==MININT before the decrement; i becomes MAXINT after the decrement, and this is an underflow, and the ANSI standard says all bets are off on underflow. Additionally, in this case, i is a loop induction variable and is known to occupy the range [99..1]. RISC compilers tend to do this sort of optimization.) A good RISC compiler may even choose to transform one of these loops to the other. Of course, the best optimization for: for (i = 1; i < 100; i++) x += i; is: x += 4950; which will typically be several hundred times faster than anyone's micro-optimization. (On another topic entirely:) > Bear in mind that C is one of the hardest languages in the world to >optimize. Not really. Try optimizing ML, or Ada, or . . . . -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov