Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!umd5!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.arch
Subject: Re: The VAX Always Uses Fewer Instructions
Message-ID: <11981@mimsy.UUCP>
Date: 15 Jun 88 20:16:09 GMT
References: <6921@cit-vax.Caltech.Edu> <28200161@urbsdc> <10595@sol.ARPA>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 65

In article <10595@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>For example, the loop to add two vectors into a third on the VAX is:
> 
>   top: addl3 (rA)+, (rB)+, (rC)+
>        sobgeq rD, top
>
>which takes seven bytes for two instructions.

True.  An optimising compiler might expand the loop, however:

	extzv	$0,$3,rD,r0
	bicl2	r0,rD			# or bicl2 $7; same length
	casel	r0,$0,$7		# start the right distance in
9:	.word	0f - 9b			# 0
	.word	1f - 9b			# 1
	...
	.word	7f - 9b			# 7
7:	addl3	(rA)+,(rB)+,(rC)+
6:	addl3	(rA)+,(rB)+,(rC)+
5:	addl3	(rA)+,(rB)+,(rC)+
4:	addl3	(rA)+,(rB)+,(rC)+
3:	addl3	(rA)+,(rB)+,(rC)+
2:	addl3	(rA)+,(rB)+,(rC)+
1:	addl3	(rA)+,(rB)+,(rC)+
0:	addl3	(rA)+,(rB)+,(rC)+
	acbl	$0,$-8,rD,7b		# while (rD-=8) >= 0

This pushes the size up to (I think) 70 bytes.  Too bad the RISC
machines are still faster anyway :-) .

Actually, you could get rid of the case and the branch table:

	extzv	$0,$3,rD,r0
	bicl2	r0,rD
	subl3	r0,$7,r0	# invert
	ashl	$2,r0,r0	# times 4, size of addl3 instr below
	jmp	(pc)[r0]	# into the breach (or is it breech?...kapow!
0:	addl3	(rA)+,(rB)+,(rC)+	# maybe an ancient muzzle loader :-) )
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	acbl	$0,$-8,rD,0b

This drops off 9 bytes, down to 61 bytes.  You can get rid of 5 more
bytes by changing the acbl into

	subl2	$8,rD
	bgeq	0b

but on non-pipelined VAXen that might be slower.  Alternatively, if you
have another register free, `mnegl $8,r1'; then acbl with r1 instead of
$-8; this saves only 1 byte overall, but brings the acbl down to 6 bytes.

[nb. the sobgeq loop above runs rD+1 times, so I made the acbl loops
do the same.  rD is left in a different state (-8 vs -1), and I did
need r0 for entry calculation.]

All of this just goes to show that the VAX provides too many ways to
do things!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris