Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!mit-eddie!uw-beaver!tera.com!david From: david@tera.com (David Callahan) Newsgroups: comp.arch Subject: Re: Shared Memory (was Terradata architecures) Message-ID: <1990Oct9.204220.18962@tera.com> Date: 9 Oct 90 20:42:20 GMT References: <1990Oct3.013509.1470@news.iastate.edu> <10651@pt.cs.cmu.edu> <11182@life.ai.mit.edu> Sender: news@tera.com Organization: Tera Computer Company Lines: 144 I've seen three basic issues raised in the discussion of shared memory multiprocessors: 1. They are/are not scalable. 2. There is a performance advantage to having the programmer explicitly manage data placement. 3. There is a hardware cost advantage in not implementing shared-memory but a software cost disadvantage. I'd like to discuss the first two; the third I believe is obvious but the net cost advantage is unknown and subject to market forces and budget pressures. This discussion does not seem limited to comp.arch. On comp.parallel there was an advance program for a workshop on ``Unstructured Scientific Computation on Scalable Multiprocessors'' where it is clear that ``scalable'' is being used as a code word for non-shared memory and in the current issue of CACM in a side bar to the article by Gul Agha there is a simple assertion that shared-memory multiprocessors are not scalable with a reference to work by Dally. What is shared-memory? An innocuous looking question but one which is hard to answer since (given points 2 and 3) we are really interested in the programming issue. At that level we might have this property: SM1. Shared-memory implies that any thread of control can access any word of memory without the explicit assistance of another thread. This definition is meant to exclude traditional message-passing programming models which are the rule in production hypercube machines but not to exclude experiments with other language paradigms such as Linda or Prolog, or the various attempts to compile shared memory programs onto message-passing systems. Clearly shared-memory could be implemented on essentially any hardware by having the language implementation provide server threads to service read and write requests. Or that capability can be directly implemented in hardware --- which is the usual meaning of shared memory in this group. This definition does not capture the programming costs inherent when ``remote'' memory is much more costly to access than ``close'' memory and so we need an additional constraint: SM2. Shared-memory implies that, with probability 1, the placement of data does not influence performance by more than a factor of 2. The clause ``with probability 1'' means that bank conflicts are unlikely (hot spots can still be a problem). The factor of 2 is obviously arbitrary but 2 seems small enough that its not worth programming around. This constraint was first described to me by Burton Smith and was influenced by Leslie Valiant (everyone should read his recent CACM article). Hot spots in an algorithm are a problem for non-shared memory systems as well and so they simply must be programmed around. What is scalability? This is another slippery term made difficult because of different assumptions about what is scaling. In particular, if I double the number of processors I am willing to run a bigger problem to get twice my current performance (no way around Amdahl's Law) but I don't want to reprogram. (Actually, most programs won't scale arbitrarily since new performance bottlenecks arise in the algorithm unless its designed with extreme care but that is architecture independent.) This gives rise to the definitions: S1. A system architecture is scalable if the deliverable performance is a linear function of the number of processors. S2. A shared-memory system is scalable if it is scalable and satisfies SM2 for any number of processors. A basic question is: can you design a scalable, shared-memory architecture? I'll answer this question by giving an example. Start with a packet-switch interconnection network. This network consists of a NxNxN 3d mesh, toroidally connected in all dimensions. Each dimension is ``folded over'' so that there are no long wires (and so satisfies S1). Now uniformly embed in this network P=NxN processors and P memory modules. This basic model satisfies SM1 and we must argue that it can also satisfy SM2 therefore S2. Note that hypercubes and meshes with dimension greater than 3 will not scale because of long wires unless corners and edges are made visible to the programmer. We will assume that memory references are uniformly distributed; the processors will hash virtual addresses to get physical addresses (as in RP3 and Cydrome) and so this assumption is valid except for hot spots. The average distance in the network is 3*(N/4) = sqrt(P)*3/64 and we define the latency, L, of the network to be the round trip time through the network plus memory chips. The network is larger than the number of processors to provide adequate bisection bandwidth. If each processor only issues one memory operation and then waits for it to complete, the system could not scale since the performance would degrade as the latency increased. In fact the processor must be able to maintain L memory operations in concurrent execution where L grows with sqrt(P). One way to do this is to have a vector processor with vector length L. Unfortunately this vector length is exposed to the programmer and restricts the general applicability of the system. An alternative is to use a data-flow style processor, which has the disadvantage that registers are hard to use effectively (though variants are currently being studied). A compromise is to have each processor support L independent instruction streams. Each stream has its own program counter and register set but all streams share the same set of functional units and the same memory port via low-level time multiplexing. These processors are generally referred to as ``multi-threaded'' processors. They have the advantage of supporting general parallelism but also have a register set to support compile-time optimizations. In all of these architectures, not all of the parallelism in a program is translated into ``speedup''. Rather some of it is ``spent'' to support shared memory (and also functional unit pipeline latencies). This parallelism is what Valiant refers to as ``parallel slackness''. For this architecture, if we double the number of processors, then the latency increases by no more than sqrt(2) (about 1.4) and so the number of streams in an running program should increase by twice this in order to realize a doubling in delivered performance. For a problem of fixed size, you will eventually run out of parallelism and the only way to improve performance is to reduce memory latency but this is merely Amdahl's Law. To reduce memory latency for this extreme case we can add local memory to each processor and expose its presence to the programmer so he can leave the shared-memory world when his problem is tight on parallelism and he absolutely must have the extra performance. This machine is essentially the product being developed by us at Tera Computer Co. We can make this system less effective at providing shared-memory by reducing the size of the network to NxNxK where K < N but leaving the number of processors at NxN. As K decreases, the system moves toward as 2D mesh-connected system which does not have the bandwidth to support shared-memory and the programmer will have to use local memory more aggressively to sustain the same level of performance. Since the network is a major cost in the system, we see that giving up shared memory will reduce hardware cost. Does it raise software costs by more? -- David Callahan david@tera.com Tera Computer Co. 400 North 34th Street, Suite 300 Seattle WA, 98103