Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!mit-eddie!genrad!decvax!ucbvax!sdcsvax!ucsdhub!jack!dsi480!man!sdiris1!rgs
From: rgs@sdiris1.UUCP
Newsgroups: comp.arch
Subject: Re: 64 Vs 32
Message-ID: <563@sdiris1.UUCP>
Date: Sat, 4-Apr-87 16:26:10 EST
Article-I.D.: sdiris1.563
Posted: Sat Apr  4 16:26:10 1987
Date-Received: Sun, 5-Apr-87 22:48:25 EST
References: <7844@utzoo.UUCP>
Organization: Control Data Corp.(CIM), San Diego
Lines: 122

From what I understand of this discussion, it breaks down as following:

64-bit advantages
o  Floating Point performance
o  Large, segmentable address space

32-bit advantages
o  Lower cost
o  Less memory wasted on integers

Other than in passing, however, I haven't noticed anyone mentioning
raw architectual performance. It seems to me that in this area the
64-bit machine has a distinct advantage.

Very few instructions are actually 64 bits in length, so multiple
instructions get packed into a single word. This means that in one memory
fetch (on a cache miss) up to 4 instructions are pulled into cache. This
should significantly improve cache performance, I would think. Likewise,
the hit rate for data cache would improve for sequentially accessed
data structures.

As it turns out tho, this has nothing to do with the CPU architecture,
but with the memory-cache interface. In fact, I know of at least one
32-bit mini computer that has a 64-bit cache to memory bus (Data General).

It would interesting to see what this change would do to simulated
performance of a machine (anyone with a good simulator want to try it out?).
What you would have would be a hybrid machine. One with a 32 bit cpu
and a 64-bit memory bus. This would be fairly simple to build if you
already have an outboard cache. I don't happen to know if the currently
popular busses (like VME) would conveniently allow a 64 bit path.

This would give you the cost and memory/integer advantage of the 32-bit CPU,
along with the performance advantage of the 64-bit machine. Additionally,
the floating point hardware now has 1 memory cycle access to a 64-bit
floating point on a cache miss, assuming you put a 64-bit path between
cache and the FP hardware (which you should have anyway).

Currently, even some 64-bit machines use larger memory data paths. One
machine I know which does this is the ETA (and it's predecessor, the
Cyber 205). This architecture has what is called a Super-word, or
sword for short. This is a 512 or 1024 bit memory access (at least last I
heard, they keep making bigger swords to improve performance). In this
case I believe it's used to improve the vector pipeline performance by
bringing in several vector elements on a memory access.

This does add an interesting twist to optimizing compilers. It would
improve program performance to have code segments start on a sword boundary.
An obvious thing would be to place all subroutine entries at a boundary.
However, any major branch-to location would probably also benifit. This
would mean that from 48 to 0 bits of the previous sword would be wasted (if
the code falls thru), so it may be worth it to define a fast "next sword"
instruction to skip the 2 or 3 NOPs you'ld have to do. Data structures may
also benifit be being sword aligned (especially if the structure is the
same size as a single sword). In this case, 64-bit floats should be sword
aligned.

The use of swords has the biggest benifit with "cache buster" types of
programs. Anything that jumps all over a large data structure would
probably benifit. Also, highly complex programs that repeat code sequences
rarely (I know our CAD program fits this class) would also benifit.
Basically, any program which gets a low cache rate currently, but which does
resonably sequential access to data and program, would now get at least a
50% cache hit. In these cases, the larger the sword (128 bits, 256 bits)
the higher the cache hit rate.

The optimal sword size is probably related to CPU speed. Assume the following
"cache buster", a vector add:
vadd(size,a,b,c)
   int size;
   int a[],b[],c[];
{
   while (--size)
      c[size] = a[size] + b[size];
}
The loop (probably) compiles to something like:
       lda  D0,@size
       lda  A0,a
       lda  A1,b
       lda  A2,c
       lda  D4,"-1"
loop:  add  D0,D4,D0    ; size -= 1
       tst  D0
       beq  fini
       add  A0,D0,A3    ; A3=A0+D0
       lda  D1,@A3
       add  A1,D0,A4    ; A4=A0+D0
       lda  D2,@A4
       add  A2,D0,A5    ; A5=A0+D0
       add  D1,D2,D3    ; D3=D1+D2
       sta  D3,@A5
       jmp  loop
fini:  ...
This does 3 memory references for 11 instructions (without using any
addressing mode, most CPUs will use fewer instructions). In this example,
if the CPU can execute 2 instructions or more in the time it takes to
access one word of external memory it's probably going to have to wait.
In this case, the larger the sword the better the cache hit rate. Coded
this was it'll do 3 memory access every other loop.
Recoding as follows:
vadd(size,a,b)
   int size;
   int a[2][],c[];
{
   while (--size)
      c[size] = a[0][size] + a[1][size];
}
Really takes benefit from the sword . As soon as a[0][size] is accessed,
a[1] is loaded into cache. Coded this way there is 1 memory access on
odd passes and 2 on even passes. This spreds out the memory access out
better. The faster the CPU (and probably the larger the data structures) then
the larger the sword should be.

All of this is off the cuff, so I really haven't had time to actually work
out some timings on paper. I really would be interested to hear if anyone
has done some simulations (or has first hand experience with this on
32 bit machines).
-- 
UUCP: ...!sdcsvax!jack!man!sdiris1!rgs |  Rusty Sanders
Work : +1 619 450 6518                 |  Control Data Corporation (CIM)
                                       |  4455 Eastgate Mall, 
Insert standard disclaimers here.      |  San Diego, CA  92121