Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!hplabs!hp-pcd!hplsla!jima From: jima@hplsla.HP.COM (Jim Adcock) Newsgroups: comp.lang.c Subject: Re: Optimal for loop on the 68020. Message-ID: <5260016@hplsla.HP.COM> Date: 12 Jun 89 19:12:09 GMT References: <11993@well.UUCP> Organization: HP Lake Stevens, WA Lines: 149 >Random thought: What are our "optimizing compilers" optimizing if they >leave loops alone? Its not that they leave loops alone, they will munge either side of a label as well as they can. Optimizing across a label is hard, since such optimizations might affect [bust] the code branching to that label. See example to follow. Again, one cannot just try some trivial loop cases, say "form X is fastest", and use it blindly from there on. Life is not that simple. Compilers do not say: "aha, he's using loop form number 5, that calls for a dbra." Typically, a good optimizing compiler is going to start by naively generating a "tree" representing one's program segment, said "tree" unfortunately containing loops and brach statements which complicate a compiler's life tremendously. Then the tree is simplified using simple rules of arthmetic: "[add [const 4] [const 1]] --> [const 5]" for example. Consider what loops and branches can easily do to complicate a compiler's life: [reg0 = [const 4]] label401: [reg1 = [const 1]] [reg2 = [add reg0 reg1]] Can the final statement be simplified to: "[reg2 = [const 5]]" ??? Maybe, but you'd have to have knowledge of all the areas that branch to label401. Clearly, compilers don't like to do this! And situations like this arise all the time when your for loops get converted to a branch plus a label. Finally, after these arithmetic simplifications, attempts are made to map the tree onto machine instructions, or combinations of machine instructions. Wierd, but useful instructions like "dbra" are likely to be handled as a peep-hole optimization -- where a final stage of the compiler tries to substitute simpler instructions for specific combinations of one or more more expensive instructions. So in real life programming, whether your favorite "dbra" instruction gets used or not is close to blind luck! Its just going to depend on exactly what you do in that loop, what registers are needed at other points in the routine, etc, etc. In real life programming these things are just not very repeatable. Consider, using gcc -O -S on: #define OUTERLOOP 10000 #define COUNT 1000 main() { int i; int j; int ary[COUNT]; for (j=OUTERLOOP; j--;) { for ( i = COUNT; i-- > 0; ) ary[i] = i; } } gives: _main: link a6,#-4000 movel d2,sp@- movel #10000,d1 jra L2 L9: movel #1000,d0 jra L5 L8: lea a6@(d0:l:4),a0 movel d0,a0@(-4000) L5: subql #1,d0 moveq #-1,d2 cmpl d0,d2 jlt L8 L2: dbra d1,L9 clrw d1 subql #1,d1 jcc L9 movel a6@(-4004),d2 unlk a6 rts which executes in 15 seconds on my mid-range 68020 system. Using the "magic trick" of: "short s; for (s = COUNT; c--;) ..." to try to match a dbra instruction: #define OUTERLOOP 10000 #define COUNT 1000 main() { short s; int j; int ary[COUNT]; for (j=OUTERLOOP; j--;) { for ( s = COUNT; s-- > 0; ) ary[s] = s; } generates: _main: link a6,#-4000 movel #10000,d1 jra L2 L9: movew #1000,d0 jra L5 L8: movew d0,a1 lea a6@(a1:l:4),a0 movel a1,a0@(-4000) L5: subqw #1,d0 cmpw #-1,d0 jgt L8 L2: dbra d1,L9 clrw d1 subql #1,d1 jcc L9 unlk a6 rts which takes 16 seconds to execute! So if one has an optimizing compiler I think one is fooling oneself if one thinks one can memorize a favorite loop structure and use it blindly. Write code using common sense, that is easy to read, and not too tricky. If you need more speed, find a different algorithm, or profile it, and work on just those areas that are too slow, re-writing in assembly if necessary. More importantly, try to get one's hands on the best optimizing compiler available for one's machine. Any damned compiler can be advertised as an "optimizing" compiler -- the word means nothing. Any damned compiler can be tweaked to generate great looking code for a few do-nothing for loops -- it means nothing. Try to get a couple of the best compilers on you machine and then try them on a variety of code indicative of the work you do. They may vary greatly in areas far more important to you than loop construction. For example, 680x0 compilers I have seen vary *greatly* in the approaches they take to using the 6888x coprocessor -- with many having a rudimentary or non-existant concept of what it means to optimize floating pt instructions. Many only use registers 0 and 1 in the 6888x coprocessor! Many flotch floating pt numbers to and from the 6888x in order to convert from single to double floating pt and back, etc, etc...... [I have yet to see a compiler which is good at both 680x0 instruction generation and 6888x instruction generation..]