Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!csd4.milw.wisc.edu!uxc!tank!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.wizards Subject: VAX loops (was Re: Optimal for loop on the 68020.) Keywords: for ( i = COUNT; --i >= 0; ) Message-ID: <17941@mimsy.UUCP> Date: 7 Jun 89 08:27:10 GMT References: <11993@well.UUCP> <13592@haddock.ima.isc.com> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 107 In article <13592@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes: >The VAX (PCC based) compiler's optimizer would convert many loops >that used multiple instructions to use a single complicated >instruction. It still does. >Unfortunately, the VAX 780 (ubiquitous at the time) generally ran >these slower. I am not sure about `generally', but some do run slower (alas!). >One case was the acbl (add, compare and branch longword). The >optimizer would replace the three instructions (something like:) > add $1,r2 > cmp $END,r2 > bne L1 > >with the "acbl" and all its arguments. Both codings take the >same number of bytes (25!). The original code is


L1:	cmpl	END,reg
	jgtr	L2
	
	addl2	$CONST,reg	# CONST must be positive
	jbr	L1
L2:	

and the replacement is

	
	decl	reg
	jbr	L2
L1:	
L2:	acbl	END,$CONST,reg,L1
	

or, if the loop body is short enough (<= 130 bytes, although c2 uses
the improper metric of `eight instructions') and CONST is 1, the
acbl is replaced with an aobleq.

In neither case is the instruction byte count 25; if we count only
instructions inside the loop, the acbl is always shorter, and in any
case it is never longer.  If we assume all branches are in bounds
(which gives the best situation for the cmp/jgtr code), we get:

L1:	cmp	END,reg		# 2+k bytes, k=1 for END constant or reg
	bgtr	L2		# 2 bytes
	
	addl2	$CONST,reg	# 2+l bytes, l=1 for CONST in 0..63
	brb	L1		# 2 bytes
L2:			# total = 8+k+l (typically 10)

versus:

	decl	reg		# 2 bytes, but outside loop
	brb	L2		# 2 bytes, but outside loop
L1:	
L2:	acbl	END,$CONST,reg,L1 # 1+k+l+1+2 = 4+k+l
			# total = 4+k+l in loop, 8+k+l overall

The most common constant, however, is 1; if the branches are in
bounds, we get an aobleq and the count drops to

L2:	aobleq	reg,END,L1	# 1+1+k+1 = 3+k

for 3+k bytes in the loop and 7+k bytes total; in this case l=1 and
the cmp/jgtr loop takes 9+k bytes total, all in the loop.

If the branches are not in range, the picture becomes much murkier,
as the `jxxx' opcodes usually `explode', but sometimes `tunnel':

		jgtr	L3
		
		
	L3:

can become

		jleq	Laround		# invert the branch
		brw	L3		# and use a longer form
	Laround:
		
	L3:

but if there happens to be a branch to L3 (either unconditional or
bgtr) that is in range of the first branch, we might get instead

		bgtr	Ltunnel		# get to someone who gets us there
		
	Ltunnel:brw	L3
		
	L3:

>The multiple instructions are faster (on the 780).

This I do not intend to test (780s being rather uninteresting).

Anyway, the real point of all this is that instruction selection,
especially for baroque CISCs like a VAX, can be difficult....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris