Xref: utzoo comp.unix.wizards:16674 comp.lang.c:19177
Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!ames!haven!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.unix.wizards,comp.lang.c
Subject: Re: Optimal for loop on the 68020.
Keywords: for ( i = COUNT; --i >= 0; )
Message-ID: <17891@mimsy.UUCP>
Date: 5 Jun 89 20:11:24 GMT
References: <11993@well.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 41

In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes:
>... COUNT was a small (< 127) compile-time constant.
>    for ( i = COUNT; --i >= 0; )

[all but gcc -O -fstrength-reduce deleted]

>	moveq  #COUNT,d0
>	jra    tag2
>tag1:
>	<loop body>
>tag2:
>	dbra   d0,tag1
>	clrw   d0
>	subql  #1,d0
>	jcc    tag1

>... But wait!  What's that chud after the loop?  Let's see, clear d1
>to zero, subtract one from it giving -1 and setting carry, and jump
>if carry is clear.  Hmm, looks like a three-instruction no-op to me!

No---the problem is that `dbra' decrements a *word*, compares the
result against -1, and (if not -1) braches.  The semantics of the
loop demands a 32 bit comparison.  The only reason it is not necessary
in this particular case is the first quoted line above.

Still, it would be nice if gcc always used the dbra/clrw/subql/jcc
sequence for `--x >= 0' loops, since it does always work.  The `clrw'
fixes up the case where the 16-bit result has gone to -1:

	before decrement:	wxyz 0000
	after decrement:	wxyz FFFF
	after clrw:		wxyz 0000
	after subql:	      wxyz-1 FFFF

The dbra loop is so much faster that the extra time and space for one
`unnecessary' dbra+clrw (when the loop really does go from 0 to -1,
and at every 65536 trips when the loop counter is large and positive)
that I would make this optimisation unconditional.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris