Xref: utzoo comp.arch:10560 comp.lang.misc:3059
Path: utzoo!attcan!uunet!dino!atanasoff!hascall
From: hascall@atanasoff.cs.iastate.edu (John Hascall)
Newsgroups: comp.arch,comp.lang.misc
Subject: Re: Programming and Machine Operations
Message-ID: <1178@atanasoff.cs.iastate.edu>
Date: 9 Jul 89 15:32:39 GMT
References: <57125@linus.UUCP> <1989Jun24.230056.27774@utzoo.uucp> <13970@haddock.ima.isc.com> <1398@l.cc.purdue.edu> <13979@haddock.ima.isc.com>
Reply-To: hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall)
Organization: Iowa State Univ. Computation Center
Lines: 73

In article <13979@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes:
>In various articles, cik & suitti argue...
 
>>> The code sequence is: ...
>>> 
>>> void lvecadd(a, b, c, s) /* a = b + c, length s */
>>> long *a; long *b; long *c; long s;
>>> {
>>> 	do {
>>> 		*a++ = *b++ + *c++;
>>> 	} while (--s != 0);
>>> }
>For the VAX:
>L18:addl3	(r9)+,(r10)+,r0
>movl	r0,(r11)+
>decl	r8
>jneq	L18
 
>>I am afraid I will have to give you a D- on this.  Most of the time I would
>>not even bother with a call, considering the code length.  But the code is
>>bad for vectors of length >2.  How about
>>{
>>	end = a + s;
>> 	do {
>> 		*a++ = *b++ + *c++;
>> 	} while (a < end);
>>}
 
>L26:addl3	(r9)+,(r10)+,r0
>movl	r0,(r11)+
>cmpl	r11,r7
>jlss	L26
 

   Of course, both routines fall in the dumper if the vector length is 0 :-)

      void lvecadd(a, b, c, s)
      register long *a, *b, *c, s;         /* a[0..s] = b[0..s] + c[0..s] */
      {
              if (s <= 0) return;          /* never trust your callers */
	      do {
		      *a++ = *b++ + *c++;
              } while (--s > 0);
      }

   giving (using VAX C      V2.4-026 [VMS]):

	                0000  lvecadd:                                  
                  007C  0000          .entry  lvecadd,^m<r2,r3,r4,r5,r6>
            5E 04 C2    0002          subl2   #4,sp                     
         53 04 AC D0    0005          movl    4(ap),r3                  
         52 08 AC D0    0009          movl    8(ap),r2                  
         51 0C AC D0    000D          movl    12(ap),r1                 
         50 10 AC D0    0011          movl    16(ap),r0                 
	       01 14    0015          bgtr    sym.1
		  04    0017          ret
			0018  sym.1:
         83 81 82 C1    0018          addl3   (r2)+,(r1)+,(r3)+
	       50 D7    001C          decl    r0   
	       F8 14    001E          bgtr    sym.1

   I'm not sure why "sobgtr r0,sym.1" wasn't used for the last two
   instructions, is "decl r0 / bgtr sym.1" faster??  Does the optimizer
   not recognize the decrement/branch sequence??  [maybe someone with a
   newer version of the compiler could try it.]

   Why are r4,r5,r6 being saved?  And what the devil is "subl2 #4,sp" for?
   Can't the optimizer clean up after itself??

   At least it got the important part (the loop) right by using (rn)+
   for all the operands.

   John Hascall