Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!csd4.milw.wisc.edu!uxc!tank!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.wizards Subject: VAX loops (was Re: Optimal for loop on the 68020.) Keywords: for ( i = COUNT; --i >= 0; ) Message-ID: <17941@mimsy.UUCP> Date: 7 Jun 89 08:27:10 GMT References: <11993@well.UUCP> <13592@haddock.ima.isc.com> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 107 In article <13592@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes: >The VAX (PCC based) compiler's optimizer would convert many loops >that used multiple instructions to use a single complicated >instruction. It still does. >Unfortunately, the VAX 780 (ubiquitous at the time) generally ran >these slower. I am not sure about `generally', but some do run slower (alas!). >One case was the acbl (add, compare and branch longword). The >optimizer would replace the three instructions (something like:) > add $1,r2 > cmp $END,r2 > bne L1 > >with the "acbl" and all its arguments. Both codings take the >same number of bytes (25!). The original code is L1: cmpl END,reg jgtr L2 addl2 $CONST,reg # CONST must be positive jbr L1 L2: and the replacement is decl reg jbr L2 L1: L2: acbl END,$CONST,reg,L1 or, if the loop body is short enough (<= 130 bytes, although c2 uses the improper metric of `eight instructions') and CONST is 1, the acbl is replaced with an aobleq. In neither case is the instruction byte count 25; if we count only instructions inside the loop, the acbl is always shorter, and in any case it is never longer. If we assume all branches are in bounds (which gives the best situation for the cmp/jgtr code), we get: L1: cmp END,reg # 2+k bytes, k=1 for END constant or reg bgtr L2 # 2 bytes addl2 $CONST,reg # 2+l bytes, l=1 for CONST in 0..63 brb L1 # 2 bytes L2: # total = 8+k+l (typically 10) versus: decl reg # 2 bytes, but outside loop brb L2 # 2 bytes, but outside loop L1: L2: acbl END,$CONST,reg,L1 # 1+k+l+1+2 = 4+k+l # total = 4+k+l in loop, 8+k+l overall The most common constant, however, is 1; if the branches are in bounds, we get an aobleq and the count drops to L2: aobleq reg,END,L1 # 1+1+k+1 = 3+k for 3+k bytes in the loop and 7+k bytes total; in this case l=1 and the cmp/jgtr loop takes 9+k bytes total, all in the loop. If the branches are not in range, the picture becomes much murkier, as the `jxxx' opcodes usually `explode', but sometimes `tunnel': jgtr L3 L3: can become jleq Laround # invert the branch brw L3 # and use a longer form Laround: L3: but if there happens to be a branch to L3 (either unconditional or bgtr) that is in range of the first branch, we might get instead bgtr Ltunnel # get to someone who gets us there Ltunnel:brw L3 L3: >The multiple instructions are faster (on the 780). This I do not intend to test (780s being rather uninteresting). Anyway, the real point of all this is that instruction selection, especially for baroque CISCs like a VAX, can be difficult.... -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris