Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!husc6!seismo!decuac!cvl!mimsy!chris From: chris@mimsy.UUCP Newsgroups: comp.lang.c Subject: Re: Portable C vs Efficient C or "Cost of Portability" Message-ID: <6171@mimsy.UUCP> Date: Tue, 7-Apr-87 09:49:01 EST Article-I.D.: mimsy.6171 Posted: Tue Apr 7 09:49:01 1987 Date-Received: Fri, 10-Apr-87 04:21:10 EST References: <213@pyuxe.UUCP> <1024@epimass.UUCP> <170@osupyr.UUCP> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 217 Keywords: portability, performance [This is long, but if you skip all the rest, please read the last paragraph.] In article <170@osupyr.UUCP> hrubin@osupyr.UUCP (Herman Rubin) writes: >I frequently find that reasonably good code can be portable, but rarely do I >find efficient code portable. By which, I guess, you mean that it takes different `tricks' on different machines to make code efficient, not that because the code has been tweaked for efficiency on one machine, it no longer runs at all on others. >In fact, rarely do I find C code (or any other HLL) to be efficient. I think you will have to define `efficient' first. >Assuming that the compiler allows you to put all the variables (but >not, of course, the arrays) in registers, the following fragment of >code is better on VAXen [reproduced below in expanded form] >while on the pyramid the reverse is true. Really? Let us see. First, here is the code being tested. You did not specify whether the arrays were to be on the stack or whether they were to be static (or, equivalently, global), so I tried it both ways. % cat xx.c /* variants: STACK => stack arrays (else STATIC => static): meaningful iff SUB SUB => subscript code (else ptr) */ f() { register int t; #ifdef SUB register int x; #ifdef STACK int m[4], h[4], k[4], z[4]; #else #ifndef STATIC oops! #else static int m[4], h[4], k[4], z[4]; #endif /* STATIC */ #endif /* STACK */ #else /* not SUB, so PTR */ register int *m, *h, *k, *z; #endif /* SUB */ #ifdef SUB t = m[x] ^ h[x]; k[x] = t; z[x] ^= t; x++; #else t = *m++ ^ *h++; *k++ = t; *z++ ^= t; #endif } And a Makefile to convert this to three assembly variants: C2ASM= cc -O -S all: v0.s v1.s v2.s v0.s: xx.c ${C2ASM} -DSTACK -DSUB xx.c; mv xx.s $@ v1.s: xx.c ${C2ASM} -DSTATIC -DSUB xx.c; mv xx.s $@ v2.s: xx.c ${C2ASM} -DPTR xx.c; mv xx.s $@ Now compare all three. In reverse order: /* Vax: PTR */ xorl3 (r9)+,(r10)+,r0 movl r0,r11 movl r11,(r8)+ xorl2 r11,(r7)+ Not bad. /* Pyr: PTR */ movw (lr1),lr0 xorw (lr2),lr0 addw $4,lr2 addw $4,lr1 movw lr0,(lr3) addw $4,lr3 movw (lr4),tr0 xorw lr0,tr0 movw tr0,(lr4) addw $4,lr4 Well, not so great. All those ++es expand to one instruction each. Next: /* Vax: STATIC SUB */ xorl3 L16[r10],L17[r10],r0 movl r0,r11 movl r11,L18[r10] xorl2 r11,L19[r10] incl r10 Not so bad either. In fact, this takes only one more instruction, and longer addressing modes. /* Pyr: STATIC SUB */ movw L13[lr1 * 4] ,lr0 xorw L14[lr1 * 4] ,lr0 movw lr0,L15[lr1 * 4] movw L16[lr1 * 4] ,tr0 xorw lr0,tr0 movw tr0,L16[lr1 * 4] addw $1,lr1 This is clearly better than Pyr: PTR. (The `* 4' is an addressing mode; these are indeed single instructions.) Finally: /* Vax: STACK SUB */ xorl3 -16(fp)[r10],-32(fp)[r10],r0 movl r0,r11 movl r11,-48(fp)[r10] xorl2 r11,-64(fp)[r10] incl r10 Still not so bad. A bit slower than STATIC SUB, since -offset(fp)'s must be computed each time, but offsets of -64..63 all fit in the instruction itself on a Vax. /* Pyr: STACK SUB */ mova -16(cfp),lr0 movw (lr0)[lr1 * 4] ,lr0 mova -32(cfp),tr0 xorw (tr0)[lr1 * 4] ,lr0 mova -48(cfp),tr0 movw lr0,(tr0)[lr1 * 4] mova -64(cfp),tr0 movw (tr0)[lr1 * 4] ,tr1 xorw lr0,tr1 movw tr1,(tr0)[lr1 * 4] addw $1,lr1 I do not know whether the offsets fit, but this is basically like the PTR version, although I count 11 instructions, and PTR was 10. What have we really learned? Very little. Unless this is an inner loop of an inner loop, none of these variations will make much difference. And if it *is* an inner loop of an inner loop, maybe we should use yet another variant: register int *h = h_arr, *m = m_arr, *k = k_arr, *z = z_arr; register int t, x; ... t = m[x] ^ h[x]; k[x] = t; z[x] ^= t; x++; which, on the Vax, becomes xorl3 (r9)[r6],(r11)[r6],r0 movl r0,r7 movl r7,(r10)[r6] xorl2 r7,(r8)[r6] incl r6 which should be ever so slightly faster than the static version. The `extra' incl in this loop over the pointer version may actually cost less than you think: With a loop around the outside, counting to (say) 1000, we get clrl r6 L2000001: xorl3 (r9)[r6],(r11)[r6],r7 movl r7,(r10)[r6] xorl2 r7,(r8)[r6] aoblss $1000,r6,L2000001 The add, test, and branch instruction makes this *better* than the pointer variant, which requires two separate instructions to do the loop! (It is conceivable that the address computation in this loop, being more complex than in the pointer loop, makes this version slower than the other, but I doubt it---and if you really care, go measure it yourself! :-) ) On the Pyramid, it compiles to this: movw (lr2)[lr5 * 4] ,lr4 xorw (lr0)[lr5 * 4] ,lr4 movw lr4,(lr1)[lr5 * 4] movw (lr3)[lr5 * 4] ,tr0 xorw lr4,tr0 movw tr0,(lr3)[lr5 * 4] addw $1,lr5 which is most likely slightly faster than the static version, and is clearly faster than the other versions. But (and this is an important `but') unless you *know* your program spends a great deal of its time in this loop, there is little point in attempting to optimise it in source code. That such optimisations may be machine dependent is immaterial when said optimisations are unnecessary, and where they *do* make a difference, if you expect the code to be ported, *point them out*! -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu