Xref: utzoo comp.unix.wizards:16820 comp.lang.c:19263 Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!ames!purdue!tut.cis.ohio-state.edu!cs.utexas.edu!uunet!mcvax!tuvie!inst182 From: inst182@tuvie (Inst.f.Techn.Informatik) Newsgroups: comp.unix.wizards,comp.lang.c Subject: Re: Optimal for loop on the 68020. Summary: Similar results on 8086, mystery Keywords: for ( i = COUNT; --i >= 0; ) Message-ID: <705@tuvie> Date: 9 Jun 89 11:58:00 GMT References: <11993@well.UUCP> Reply-To: inst182@tuvie.UUCP (Inst.f.Techn.Informatik) Organization: Technical University of Vienna, EDP-Center Lines: 351 I tried the same kinds of for loops on a XT-compatible (8086, 8MHz) using Turbo-C 2.0. (tcc -G -O -Z -S) I also tested for (i = COUNT; i --;); because I thought, it would be the fastest. For integers, the fastest loop was for (i = COUNT; -- i >= 0;); which turned out to be 1.7 times as fast as the slowest. It is remarkable, that this is the same factor as Jef Poskanzer has timed on his Sun. #define COUNT 30000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 127 ms xor si,si jmp short @5 @4: inc si @5: cmp si,30000 jl @4 Comparation against a non-zero value ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 115 ms xor si,si jmp short @9 @8: inc si @9: cmp si,30000 jl @8 Can anybody explain, why this is faster than the above ???? ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 132 ms xor si,si @13: inc si mov ax,si cmp ax,30000 jle @13 Moving si to ax seems rather useless to me. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 132 ms xor si,si @17: mov ax,si inc si cmp ax,30000 jl @17 Ok, here it is neccessary ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 94 ms mov si,30000 jmp short @21 @20: dec si @21: or si,si jg @20 The separation of decrement and comparation causes an additional or ---------------------------------------------------------------- for (i = COUNT; i > 0; --i); 93 ms mov si,30000 jmp short @25 @24: dec si @25: or si,si jg @24 Looks exactly like the version above. The 1 ms difference could be a timing error. ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 77 ms mov si,30000 @29: dec si jge @29 This surely is the optimal loop ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 115 ms mov si,30000 @33: mov ax,si dec si or ax,ax jg @33 The former value of the counter has to be saved for the comparison. ---------------------------------------------------------------- for (i = COUNT; i--;); 115 ms mov si,30000 @37: mov ax,si dec si or ax,ax jne @37 ================================================================ Then I tried the same with a long counter (instead of int), which changed the results radically. Not only is the difference between the fastest and the slowest loop much lesser (The factor is now only 1.13), but version 7, which has been the fastest for integers is now the second slowest, and version 9 now is the fastest. Versions 3 and 4 have remained at the slow end of the charts. #define COUNT 60000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @5 @4: add word ptr [bp-4],1 adc word ptr [bp-2],0 @5: cmp word ptr [bp-2],0 jl @4 jne @38 cmp word ptr [bp-4],-5536 jb @4 @38: ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @9 @8: add word ptr [bp-4],1 adc word ptr [bp-2],0 @9: cmp word ptr [bp-2],0 jl @8 jne @39 cmp word ptr [bp-4],-5536 jb @8 @39: ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 1056 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @13: add word ptr [bp-4],1 adc word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jl @13 jg @40 cmp ax,-5536 jbe @13 @40: Now the compiler seems to have forgotten that a memory location can be compared against a constant. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 1009 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @17: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] add word ptr [bp-4],1 adc word ptr [bp-2],0 or dx,dx jl @17 jne @41 cmp ax,-5536 jb @17 @41: ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @21 @20: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @21: cmp word ptr [bp-2],0 jg @20 jne @42 cmp word ptr [bp-4],0 ja @20 @42: Now it seems to have regained this knowledge and is fast as before. Comparing against zero is not faster here than comparing against another constant. ---------------------------------------------------------------- for (i = COUNT; i > 0; -- i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @25 @24: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @25: cmp word ptr [bp-2],0 jg @24 jne @43 cmp word ptr [bp-4],0 ja @24 @43: ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 1051 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @29: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jg @29 jl @44 or ax,ax jae @29 @44: Again, the compiler fetches the counter into registers before comparing. Here comparing against zero saves a little time. But letting the counter in memory would have saved more. ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 1004 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @33: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,dx jg @33 jne @45 or ax,ax ja @33 @45: ---------------------------------------------------------------- for (i = COUNT; i--;); 934 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @37: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,ax jne @37 Checking a long for *equality* to zero saves lots of time, so the overhead for loading the registers is more than compensated. ---------------------------------------------------------------- ??? 374 ms mov si,0 mov di,30000 @_l: mov dx,si mov ax,di sub di,1 sbb si,0 or dx,ax jne @_l This would have been the optimal loop, but Turbo-C does not put long variables into registers. ================================================================ Please email to address below, because the account I was posting this from is not mine. _______________________________________________________________ | __ | | | | | \ | Peter J. Holzer | | |___|__/ | Technische Universitaet Wien | | | | | | | | | | ...!uunet!mcvax!tuvie!asupa!honey!hjphr | | ____/ |--------------------------------------------------| | | Think of it as evolution in action -- Tony Rand | |____________|__________________________________________________|