Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!mimsy!chris
From: chris@mimsy.UUCP
Newsgroups: comp.arch
Subject: Re: 64 Vs 32
Message-ID: <6149@mimsy.UUCP>
Date: Sun, 5-Apr-87 14:25:46 EST
Article-I.D.: mimsy.6149
Posted: Sun Apr  5 14:25:46 1987
Date-Received: Sun, 5-Apr-87 23:39:01 EST
References: <7844@utzoo.UUCP> <563@sdiris1.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 78

In article <563@sdiris1.UUCP> rgs@sdiris1.UUCP (Rusty Sanders) writes:
>... In fact, I know of at least one 32-bit mini computer that has
>a 64-bit cache to memory bus (Data General).

The Vax 11/780 has a 64 bit backplane (the SBI) between its cache
and its memory.

>This does add an interesting twist to optimizing compilers. It would
>improve program performance to have code segments start on a [superword]
>boundary.  An obvious thing would be to place all subroutine entries
>at a boundary.

The Unix Vax assembler has a `.align' directive for such purposes,
but the compiler emits only `.align 1's, which align to 2**1 bytes or
16 bit boundaries---probably because the first thing at each routine
is a short word containing a register save mask (and a few other bits
that are essentially never set anyway).

>The use of swords has the biggest benifit with "cache buster" types of
>programs.

... provided such programs were written carefully.  Similar to the
`cache buster' is the `VM buster': a program with multidimensional
arrays where the fastest-varying subscript is varied the slowest
(if that makes sense: if not, there is an example below).

>... Recoding as follows:
>vadd(size,a,b)
>   int size;
>   int a[2][],c[];
>{
>   while (--size)
>      c[size] = a[0][size] + a[1][size];
>}

This looks like the coder was `thinking FORTRAN and writing C', which
is often a performance disaster (as is `thinking C and writing FORTRAN').
Aside from the nits:

>As soon as a[0][size] is accessed, a[1] is loaded into cache.

Since the last subscript varies fastest in C, as soon as a[0][size]
is accessed, a[0][size+1] is cached.

A C matrix add loop should read

	for (i = 0; i < size1; i++)
		for (j = 0; j < size2; j++)
			c[i][j] = a[i][j] + b[i][j];
	/* and of course you can optimise with */
	/* pointers, if it really comes to that. */

while the Ratfor loop should read

	for (j = 1; j <= size; j = j + 1)
		for (i = 1; i <= size; i = i + 1)
			c(i, j) = a(i, j) + b(i, j)

(subscripts *do* start at one in FORTRAN?).  Reversing the loops
can have terrible effects on performance, due to cache effects (as
described above) and due to `unexpected' VM behaviour (the scattered
array references cause excessive page faults).

(For the nit-p..., er, record, most likely what was meant was

	vadd(size, a, c)
		register int size;
		register int a[][2], c[];
	/* or	register int (*a)[2], *c; */
	{

		while (--size >= 0)
			c[size] = a[size][0] + a[size][1];
	}
)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu