Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!mit-eddie!genrad!decvax!ucbvax!sdcsvax!ucsdhub!jack!dsi480!man!sdiris1!rgs From: rgs@sdiris1.UUCP Newsgroups: comp.arch Subject: Re: 64 Vs 32 Message-ID: <563@sdiris1.UUCP> Date: Sat, 4-Apr-87 16:26:10 EST Article-I.D.: sdiris1.563 Posted: Sat Apr 4 16:26:10 1987 Date-Received: Sun, 5-Apr-87 22:48:25 EST References: <7844@utzoo.UUCP> Organization: Control Data Corp.(CIM), San Diego Lines: 122 From what I understand of this discussion, it breaks down as following: 64-bit advantages o Floating Point performance o Large, segmentable address space 32-bit advantages o Lower cost o Less memory wasted on integers Other than in passing, however, I haven't noticed anyone mentioning raw architectual performance. It seems to me that in this area the 64-bit machine has a distinct advantage. Very few instructions are actually 64 bits in length, so multiple instructions get packed into a single word. This means that in one memory fetch (on a cache miss) up to 4 instructions are pulled into cache. This should significantly improve cache performance, I would think. Likewise, the hit rate for data cache would improve for sequentially accessed data structures. As it turns out tho, this has nothing to do with the CPU architecture, but with the memory-cache interface. In fact, I know of at least one 32-bit mini computer that has a 64-bit cache to memory bus (Data General). It would interesting to see what this change would do to simulated performance of a machine (anyone with a good simulator want to try it out?). What you would have would be a hybrid machine. One with a 32 bit cpu and a 64-bit memory bus. This would be fairly simple to build if you already have an outboard cache. I don't happen to know if the currently popular busses (like VME) would conveniently allow a 64 bit path. This would give you the cost and memory/integer advantage of the 32-bit CPU, along with the performance advantage of the 64-bit machine. Additionally, the floating point hardware now has 1 memory cycle access to a 64-bit floating point on a cache miss, assuming you put a 64-bit path between cache and the FP hardware (which you should have anyway). Currently, even some 64-bit machines use larger memory data paths. One machine I know which does this is the ETA (and it's predecessor, the Cyber 205). This architecture has what is called a Super-word, or sword for short. This is a 512 or 1024 bit memory access (at least last I heard, they keep making bigger swords to improve performance). In this case I believe it's used to improve the vector pipeline performance by bringing in several vector elements on a memory access. This does add an interesting twist to optimizing compilers. It would improve program performance to have code segments start on a sword boundary. An obvious thing would be to place all subroutine entries at a boundary. However, any major branch-to location would probably also benifit. This would mean that from 48 to 0 bits of the previous sword would be wasted (if the code falls thru), so it may be worth it to define a fast "next sword" instruction to skip the 2 or 3 NOPs you'ld have to do. Data structures may also benifit be being sword aligned (especially if the structure is the same size as a single sword). In this case, 64-bit floats should be sword aligned. The use of swords has the biggest benifit with "cache buster" types of programs. Anything that jumps all over a large data structure would probably benifit. Also, highly complex programs that repeat code sequences rarely (I know our CAD program fits this class) would also benifit. Basically, any program which gets a low cache rate currently, but which does resonably sequential access to data and program, would now get at least a 50% cache hit. In these cases, the larger the sword (128 bits, 256 bits) the higher the cache hit rate. The optimal sword size is probably related to CPU speed. Assume the following "cache buster", a vector add: vadd(size,a,b,c) int size; int a[],b[],c[]; { while (--size) c[size] = a[size] + b[size]; } The loop (probably) compiles to something like: lda D0,@size lda A0,a lda A1,b lda A2,c lda D4,"-1" loop: add D0,D4,D0 ; size -= 1 tst D0 beq fini add A0,D0,A3 ; A3=A0+D0 lda D1,@A3 add A1,D0,A4 ; A4=A0+D0 lda D2,@A4 add A2,D0,A5 ; A5=A0+D0 add D1,D2,D3 ; D3=D1+D2 sta D3,@A5 jmp loop fini: ... This does 3 memory references for 11 instructions (without using any addressing mode, most CPUs will use fewer instructions). In this example, if the CPU can execute 2 instructions or more in the time it takes to access one word of external memory it's probably going to have to wait. In this case, the larger the sword the better the cache hit rate. Coded this was it'll do 3 memory access every other loop. Recoding as follows: vadd(size,a,b) int size; int a[2][],c[]; { while (--size) c[size] = a[0][size] + a[1][size]; } Really takes benefit from the sword . As soon as a[0][size] is accessed, a[1] is loaded into cache. Coded this way there is 1 memory access on odd passes and 2 on even passes. This spreds out the memory access out better. The faster the CPU (and probably the larger the data structures) then the larger the sword should be. All of this is off the cuff, so I really haven't had time to actually work out some timings on paper. I really would be interested to hear if anyone has done some simulations (or has first hand experience with this on 32 bit machines). -- UUCP: ...!sdcsvax!jack!man!sdiris1!rgs | Rusty Sanders Work : +1 619 450 6518 | Control Data Corporation (CIM) | 4455 Eastgate Mall, Insert standard disclaimers here. | San Diego, CA 92121