Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!husc6!seismo!decuac!cvl!mimsy!chris
From: chris@mimsy.UUCP
Newsgroups: comp.lang.c
Subject: Re: Portable C vs Efficient C or "Cost of Portability"
Message-ID: <6171@mimsy.UUCP>
Date: Tue, 7-Apr-87 09:49:01 EST
Article-I.D.: mimsy.6171
Posted: Tue Apr  7 09:49:01 1987
Date-Received: Fri, 10-Apr-87 04:21:10 EST
References: <213@pyuxe.UUCP> <1024@epimass.UUCP> <170@osupyr.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 217
Keywords: portability, performance

[This is long, but if you skip all the rest, please read the last
paragraph.]

In article <170@osupyr.UUCP> hrubin@osupyr.UUCP (Herman Rubin) writes:
>I frequently find that reasonably good code can be portable, but rarely do I
>find efficient code portable.

By which, I guess, you mean that it takes different `tricks' on
different machines to make code efficient, not that because the code
has been tweaked for efficiency on one machine, it no longer runs at
all on others.

>In fact, rarely do I find C code (or any other HLL) to be efficient.

I think you will have to define `efficient' first.

>Assuming that the compiler allows you to put all the variables (but
>not, of course, the arrays) in registers, the following fragment of
>code is better on VAXen
[reproduced below in expanded form]
>while on the pyramid the reverse is true.

Really?  Let us see.

First, here is the code being tested.  You did not specify whether the
arrays were to be on the stack or whether they were to be static (or,
equivalently, global), so I tried it both ways.

	% cat xx.c
	/* variants:
		STACK => stack arrays (else STATIC => static): meaningful
			iff SUB
		SUB => subscript code (else ptr)
	*/

	f()
	{
		register int t;
	#ifdef SUB
		register int x;
	#ifdef STACK
		int m[4], h[4], k[4], z[4];
	#else
	#ifndef STATIC
		oops!
	#else
		static int m[4], h[4], k[4], z[4];
	#endif /* STATIC */
	#endif /* STACK */
	#else /* not SUB, so PTR */
		register int *m, *h, *k, *z;
	#endif /* SUB */

	#ifdef SUB
		t = m[x] ^ h[x];
		k[x] = t;
		z[x] ^= t;
		x++;
	#else
		t = *m++ ^ *h++;
		*k++ = t;
		*z++ ^= t;
	#endif
	}

And a Makefile to convert this to three assembly variants:

	C2ASM=	cc -O -S

	all: v0.s v1.s v2.s

	v0.s: xx.c
		${C2ASM} -DSTACK -DSUB xx.c; mv xx.s $@

	v1.s: xx.c
		${C2ASM} -DSTATIC -DSUB xx.c; mv xx.s $@

	v2.s: xx.c
		${C2ASM} -DPTR xx.c; mv xx.s $@

Now compare all three.  In reverse order:

	/* Vax: PTR */
	xorl3	(r9)+,(r10)+,r0
	movl	r0,r11
	movl	r11,(r8)+
	xorl2	r11,(r7)+

Not bad.

	/* Pyr: PTR */
	movw	(lr1),lr0
	xorw	(lr2),lr0
	addw	$4,lr2
	addw	$4,lr1
	movw	lr0,(lr3)
	addw	$4,lr3
	movw	(lr4),tr0
	xorw	lr0,tr0
	movw	tr0,(lr4)
	addw	$4,lr4

Well, not so great.  All those ++es expand to one instruction each.

Next:
	/* Vax: STATIC SUB */
	xorl3	L16[r10],L17[r10],r0
	movl	r0,r11
	movl	r11,L18[r10]
	xorl2	r11,L19[r10]
	incl	r10

Not so bad either.  In fact, this takes only one more instruction,
and longer addressing modes.

	/* Pyr: STATIC SUB */
	movw	L13[lr1 * 4] ,lr0
	xorw	L14[lr1 * 4] ,lr0
	movw	lr0,L15[lr1 * 4] 
	movw	L16[lr1 * 4] ,tr0
	xorw	lr0,tr0
	movw	tr0,L16[lr1 * 4] 
	addw	$1,lr1

This is clearly better than Pyr: PTR.  (The `* 4' is an addressing
mode; these are indeed single instructions.)

Finally:
	/* Vax: STACK SUB */
	xorl3	-16(fp)[r10],-32(fp)[r10],r0
	movl	r0,r11
	movl	r11,-48(fp)[r10]
	xorl2	r11,-64(fp)[r10]
	incl	r10

Still not so bad.  A bit slower than STATIC SUB, since -offset(fp)'s
must be computed each time, but offsets of -64..63 all fit in the
instruction itself on a Vax.

	/* Pyr: STACK SUB */
	mova	-16(cfp),lr0
	movw	(lr0)[lr1 * 4] ,lr0
	mova	-32(cfp),tr0
	xorw	(tr0)[lr1 * 4] ,lr0
	mova	-48(cfp),tr0
	movw	lr0,(tr0)[lr1 * 4] 
	mova	-64(cfp),tr0
	movw	(tr0)[lr1 * 4] ,tr1
	xorw	lr0,tr1
	movw	tr1,(tr0)[lr1 * 4] 
	addw	$1,lr1

I do not know whether the offsets fit, but this is basically like
the PTR version, although I count 11 instructions, and PTR was 10.

What have we really learned?  Very little.  Unless this is an inner
loop of an inner loop, none of these variations will make much
difference.  And if it *is* an inner loop of an inner loop, maybe
we should use yet another variant:

	register int *h = h_arr, *m = m_arr, *k = k_arr, *z = z_arr;
	register int t, x;
	...
	t = m[x] ^ h[x];
	k[x] = t;
	z[x] ^= t;
	x++;

which, on the Vax, becomes

	xorl3	(r9)[r6],(r11)[r6],r0
	movl	r0,r7
	movl	r7,(r10)[r6]
	xorl2	r7,(r8)[r6]
	incl	r6

which should be ever so slightly faster than the static version.
The `extra' incl in this loop over the pointer version may actually
cost less than you think:  With a loop around the outside, counting
to (say) 1000, we get

	clrl	r6
L2000001:
	xorl3	(r9)[r6],(r11)[r6],r7
	movl	r7,(r10)[r6]
	xorl2	r7,(r8)[r6]
	aoblss	$1000,r6,L2000001

The add, test, and branch instruction makes this *better* than the
pointer variant, which requires two separate instructions to do
the loop!  (It is conceivable that the address computation in this
loop, being more complex than in the pointer loop, makes this
version slower than the other, but I doubt it---and if you really
care, go measure it yourself! :-) )

On the Pyramid, it compiles to this:

	movw	(lr2)[lr5 * 4] ,lr4
	xorw	(lr0)[lr5 * 4] ,lr4
	movw	lr4,(lr1)[lr5 * 4] 
	movw	(lr3)[lr5 * 4] ,tr0
	xorw	lr4,tr0
	movw	tr0,(lr3)[lr5 * 4] 
	addw	$1,lr5

which is most likely slightly faster than the static version, and
is clearly faster than the other versions.

But (and this is an important `but') unless you *know* your program
spends a great deal of its time in this loop, there is little point
in attempting to optimise it in source code.  That such optimisations
may be machine dependent is immaterial when said optimisations are
unnecessary, and where they *do* make a difference, if you expect
the code to be ported, *point them out*!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu