Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!mit-eddie!uw-beaver!tera.com!david
From: david@tera.com (David Callahan)
Newsgroups: comp.arch
Subject: Re: Shared Memory (was Terradata architecures)
Message-ID: <1990Oct9.204220.18962@tera.com>
Date: 9 Oct 90 20:42:20 GMT
References: <1990Oct3.013509.1470@news.iastate.edu> <10651@pt.cs.cmu.edu> <11182@life.ai.mit.edu>
Sender: news@tera.com
Organization: Tera Computer Company
Lines: 144

I've seen three basic issues raised in the discussion
of shared memory multiprocessors:

1. They are/are not scalable.

2. There is a performance advantage to having the programmer
explicitly manage data placement.

3. There is a hardware cost advantage in not implementing
shared-memory but a software cost disadvantage.

I'd like to discuss the first two; the third I believe is 
obvious but the net cost advantage is unknown and subject to 
market forces and budget pressures. This discussion does not seem
limited to comp.arch. On comp.parallel there was an advance program
for a workshop on ``Unstructured Scientific Computation on 
Scalable Multiprocessors'' where it is clear that ``scalable'' is
being used as a code word for non-shared memory and in the current
issue of CACM in a side bar to the article by Gul Agha there is a
simple assertion that shared-memory multiprocessors are not scalable
with a reference to work by Dally.

What is shared-memory? An innocuous looking question but one which is
hard to answer since (given points 2 and 3) we are really interested
in the programming issue. At that level we might have this property:

SM1. Shared-memory implies that any thread of control can access any
word of memory without the explicit assistance of another thread.

This definition is meant to exclude traditional message-passing
programming models which are the rule in production hypercube machines
but not to exclude experiments with other language paradigms such as
Linda or Prolog, or the various attempts to compile shared memory programs onto
message-passing systems. Clearly shared-memory could be implemented on
essentially any hardware by having the language implementation provide
server threads to service read and write requests. Or that capability
can be directly implemented in hardware --- which is the usual meaning
of shared memory in this group.

This definition does not capture the programming costs inherent when 
``remote'' memory is much more costly to access than ``close'' memory
and so we need an additional constraint:

SM2. Shared-memory implies that, with probability 1, the placement of
data does not influence performance by more than a factor of 2.

The clause ``with probability 1'' means that bank conflicts are
unlikely (hot spots can still be a problem). The factor of 2 is
obviously arbitrary but 2 seems small enough that its not worth
programming around.  This constraint was first described to me by
Burton Smith and was influenced by Leslie Valiant (everyone should
read his recent CACM article).  Hot spots in an algorithm are a
problem for non-shared memory systems as well and so they simply must
be programmed around.

What is scalability? This is another slippery term made difficult
because of different assumptions about what is scaling. In particular,
if I double the number of processors I am willing to run a bigger
problem to get twice my current performance (no way around Amdahl's
Law) but I don't want to reprogram. (Actually, most programs won't
scale arbitrarily since new performance bottlenecks arise in the
algorithm unless its designed with extreme care but that is 
architecture independent.) This gives rise to the definitions:

S1. A system architecture is scalable if the deliverable performance
is a linear function of the number of processors.

S2. A shared-memory system is scalable if it is scalable and
satisfies SM2 for any number of processors.

A basic question is: can you design a scalable, shared-memory
architecture? I'll answer this question by giving an example.  Start
with a packet-switch interconnection network. This network consists of
a NxNxN 3d mesh, toroidally connected in all dimensions.  Each
dimension is ``folded over'' so that there are no long wires (and so
satisfies S1).  Now uniformly embed in this network P=NxN processors
and P memory modules.  This basic model satisfies SM1 and we must
argue that it can also satisfy SM2 therefore S2. Note that hypercubes
and meshes with dimension greater than 3 will not scale because of
long wires unless corners and edges are made visible to the
programmer.

We will assume that memory references are uniformly distributed; the
processors will hash virtual addresses to get physical
addresses (as in RP3 and Cydrome) and so this assumption is valid except
for hot spots. The average distance in the network is 
	3*(N/4) = sqrt(P)*3/64
and we define the latency, L, of the network to be the round trip
time through the network plus memory chips. The network is larger than
the number of processors to provide adequate bisection bandwidth.
	
If each processor only issues one memory operation and then waits for
it to complete, the system could not scale since the performance would
degrade as the latency increased. In fact the processor must be able
to maintain L memory operations in concurrent execution where L grows
with sqrt(P).

One way to do this is to have a vector processor with vector length L.
Unfortunately this vector length is exposed to the programmer and
restricts the general applicability of the system.  An alternative is
to use a data-flow style processor, which has the disadvantage that
registers are hard to use effectively (though variants are currently
being studied).

A compromise is to have each processor support L independent
instruction streams. Each stream has its own program counter and
register set but all streams share the same set of functional units
and the same memory port via low-level time multiplexing. These
processors are generally referred to as ``multi-threaded'' processors.
They have the advantage of supporting general parallelism but also have
a register set to support compile-time optimizations.

In all of these architectures, not all of the parallelism in a program
is translated into ``speedup''. Rather some of it is ``spent'' to
support shared memory (and also functional unit pipeline latencies).
This parallelism is what Valiant refers to as ``parallel
slackness''. 

For this architecture, if we double the number of processors, then the
latency increases by no more than sqrt(2) (about 1.4) and so the
number of streams in an running program should increase by twice this
in order to realize a doubling in delivered performance.  For a
problem of fixed size, you will eventually run out of parallelism and
the only way to improve performance is to reduce memory latency but
this is merely Amdahl's Law.  To reduce memory latency for this
extreme case we can add local memory to each processor and expose its
presence to the programmer so he can leave the shared-memory world
when his problem is tight on parallelism and he absolutely must have
the extra performance. This machine is essentially the product being
developed by us at Tera Computer Co.

We can make this system less effective at providing shared-memory by
reducing the size of the network to NxNxK where K < N but leaving the
number of processors at NxN. As K decreases, the system moves toward
as 2D mesh-connected system which does not have the bandwidth to
support shared-memory and the programmer will have to use local memory
more aggressively to sustain the same level of performance.  Since the
network is a major cost in the system, we see that giving up shared
memory will reduce hardware cost.  Does it raise software costs by
more?

--
David Callahan  				       david@tera.com
Tera Computer Co. 400 North 34th Street, Suite 300  Seattle WA, 98103