Xref: utzoo comp.unix.wizards:16820 comp.lang.c:19263
Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!ames!purdue!tut.cis.ohio-state.edu!cs.utexas.edu!uunet!mcvax!tuvie!inst182
From: inst182@tuvie (Inst.f.Techn.Informatik)
Newsgroups: comp.unix.wizards,comp.lang.c
Subject: Re: Optimal for loop on the 68020.
Summary: Similar results on 8086, mystery
Keywords: for ( i = COUNT; --i >= 0; )
Message-ID: <705@tuvie>
Date: 9 Jun 89 11:58:00 GMT
References: <11993@well.UUCP>
Reply-To: inst182@tuvie.UUCP (Inst.f.Techn.Informatik)
Organization: Technical University of Vienna, EDP-Center
Lines: 351

I tried the same kinds of for loops on a XT-compatible (8086, 8MHz)
using Turbo-C 2.0. (tcc -G -O -Z -S)

I also tested

for (i = COUNT; i --;);

because I thought, it would be the fastest.

For integers, the fastest loop was

for (i = COUNT; -- i >= 0;);

which turned out to be 1.7 times as fast as the slowest. It is
remarkable, that this is the same factor as Jef Poskanzer has timed on
his Sun.

#define COUNT 30000

----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

127 ms

	xor	si,si
	jmp	short @5
@4:
	inc	si
@5:
	cmp	si,30000
	jl	@4

Comparation against a non-zero value
----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

115 ms

	xor	si,si
	jmp	short @9
@8:
	inc	si
@9:
	cmp	si,30000
	jl	@8

Can anybody explain, why this is faster than the above ????
----------------------------------------------------------------
for (i = 0; ++ i <= COUNT;);

132 ms

	xor	si,si
@13:
	inc	si
	mov	ax,si
	cmp	ax,30000
	jle	@13

Moving si to ax seems rather useless to me.
----------------------------------------------------------------
for (i = 0; i++ < COUNT;);

132 ms

	xor	si,si
@17:
	mov	ax,si
	inc	si
	cmp	ax,30000
	jl	@17

Ok, here it is neccessary
----------------------------------------------------------------
for (i = COUNT; i > 0; i --);

94 ms

	mov	si,30000
	jmp	short @21
@20:
	dec	si
@21:
	or	si,si
	jg	@20

The separation of decrement and comparation causes an additional or
----------------------------------------------------------------
for (i = COUNT; i > 0; --i);

93 ms

	mov	si,30000
	jmp	short @25
@24:
	dec	si
@25:
	or	si,si
	jg	@24

Looks exactly like the version above. The 1 ms difference could be a
timing error.
----------------------------------------------------------------
for (i = COUNT; --i >= 0;);

77 ms

	mov	si,30000
@29:
	dec	si
	jge	@29

This surely is the optimal loop
----------------------------------------------------------------
for (i = COUNT; i-- > 0;);

115 ms

	mov	si,30000
@33:
	mov	ax,si
	dec	si
	or	ax,ax
	jg	@33

The former value of the counter has to be saved for the comparison.
----------------------------------------------------------------
for (i = COUNT; i--;);

115 ms

	mov	si,30000
@37:
	mov	ax,si
	dec	si
	or	ax,ax
	jne	@37
================================================================

Then I tried the same with a long counter (instead of int), which
changed the results radically. Not only is the difference between the
fastest and the slowest loop much lesser (The factor is now only
1.13), but version 7, which has been the fastest for integers is now
the second slowest, and version 9 now is the fastest.

Versions 3 and 4 have remained at the slow end of the charts.

#define COUNT	60000
----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @5
@4:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@5:
	cmp	word ptr [bp-2],0
	jl	@4
	jne	@38
	cmp	word ptr [bp-4],-5536
	jb	@4
@38:

----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @9
@8:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@9:
	cmp	word ptr [bp-2],0
	jl	@8
	jne	@39
	cmp	word ptr [bp-4],-5536
	jb	@8
@39:
----------------------------------------------------------------
	for (i = 0; ++ i <= COUNT;);

1056 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@13:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jl	@13
	jg	@40
	cmp	ax,-5536
	jbe	@13
@40:

Now the compiler seems to have forgotten that a memory location can be
compared against a constant.
----------------------------------------------------------------
	for (i = 0; i++ < COUNT;);

1009 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@17:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	or	dx,dx
	jl	@17
	jne	@41
	cmp	ax,-5536
	jb	@17
@41:
----------------------------------------------------------------
	for (i = COUNT; i > 0; i --);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @21
@20:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@21:
	cmp	word ptr [bp-2],0
	jg	@20
	jne	@42
	cmp	word ptr [bp-4],0
	ja	@20
@42:

Now it seems to have regained this knowledge and is fast as before.
Comparing against zero is not faster here than comparing against
another constant.
----------------------------------------------------------------
	for (i = COUNT; i > 0; -- i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @25
@24:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@25:
	cmp	word ptr [bp-2],0
	jg	@24
	jne	@43
	cmp	word ptr [bp-4],0
	ja	@24
@43:
----------------------------------------------------------------
	for (i = COUNT; --i >= 0;);

1051 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@29:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jg	@29
	jl	@44
	or	ax,ax
	jae	@29
@44:

Again, the compiler fetches the counter into registers before
comparing. Here comparing against zero saves a little time.
But letting the counter in memory would have saved more.
----------------------------------------------------------------
	for (i = COUNT; i-- > 0;);

1004 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@33:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,dx
	jg	@33
	jne	@45
	or	ax,ax
	ja	@33
@45:
----------------------------------------------------------------
	for (i = COUNT; i--;);

934 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@37:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,ax
	jne	@37

Checking a long for *equality* to zero saves lots of time, so the
overhead for loading the registers is more than compensated.
----------------------------------------------------------------
	???

374 ms

	mov	si,0
	mov	di,30000
@_l:
	mov	dx,si
	mov	ax,di
	sub	di,1
	sbb	si,0
	or	dx,ax
	jne	@_l

This would have been the optimal loop, but Turbo-C does not put long
variables into registers.
================================================================

Please email to address below, because the account I was posting this from is not
mine.
 _______________________________________________________________
|	__   |							|
|  |   |  \  | Peter J. Holzer					|
|  |___|__/  | Technische Universitaet Wien			|
|  |   |     |							|
|  |   |     | ...!uunet!mcvax!tuvie!asupa!honey!hjphr		|
|  ____/     |--------------------------------------------------|
|	     | Think of it as evolution in action -- Tony Rand	|
|____________|__________________________________________________|