Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!cs.utexas.edu!rice!uw-beaver!uw-june!rik
From: rik@cs.washington.edu (Rik Littlefield)
Newsgroups: comp.arch
Subject: Re: Shared Memory vs. Distributed Systems
Keywords: Shared, distributed, message
Message-ID: <9605@june.cs.washington.edu>
Date: 26 Oct 89 18:46:33 GMT
References: <20764@usc.edu> <1646@ncrcce.StPaul.NCR.COM>
Organization: U of Washington, Computer Science, Seattle
Lines: 57

In article <1646@ncrcce.StPaul.NCR.COM>, pasek@ncrcce.StPaul.NCR.COM 
(Michael A. Pasek) writes:  [about the transputer]

> Although you describe this as "message passing" between the computers,
> it sounds to me like you have one "global" memory which is controlled by the
> "dedicated DMA machine", and this memory is "shared" by all the "local"
> processors.  What difference does it make whether you set some register in
> your "dedicated DMA machine" and tell it to move some data to another 
> location, or just set an address latch in your micro somewhere and do a
> "store" instruction ?  
> 

It matters a great deal, both in raw performance and in how you program
the beasts.

On a transputer, 10 microseconds is what, 30 instructions?  Say you're in a
tiny network, average 3 hops each out and back.  That's 6 hops, 180
instructions latency.  We can quibble about the numbers, but the point is,
it's a long time.  What this means is that it's dangerous to think "when I
need this datum, I'll just go get it".  That programming model works OK on
the "shared memory" machines, which have much lower latency.  On a
transputer array, it gives out quickly as you try programs that do
progressively more sharing.  The same thing happens on any other existing
distributed memory machine.  Of course, as several posters have pointed out,
if each node has enough things to do, you may be able to hide the latency.

Perhaps a more fundamental difference is that shared memory machines
provide hardware to guarantee that all processors have a consistent view of
memory.  Cooperation between processors can be controlled just by
synchronizing.  There's no need to explicitly update other processors'
copies of shared data.  With the distributed memory machines, no processor
will find out about an update until it asks or another one tells it.  The
programmer can be insulated from this necessity by the compiler and/or
runtime system, as with Kai Li's shared virtual memory (why isn't it
"virtual shared memory"?), but the performance penalty can be severe if
you're unlucky about the sharing patterns.

It looks to me like the distributed memory machines are pretty good at
running three kinds of programs.  First are those that just don't require
much write-sharing.  Second are those with communication patterns that can
be mapped to nearest-neighbor links, assuming that you're lucky enough to
have a low-overhead machine like the transputer.  Third are those in which
the latency can be hidden and overhead amortized by handling lots of
independent store/fetch requests at once.  (Combinations of the above are
best of all ;-)

Using that third approach, there seem to be quite a few applications that
can run well even with high communications overhead.  (See Fox's book,
"Solving Problems on Concurrent Processors".)  But lack of software support
is a big problem.  Programs like that can be written using explicit message
passing, but it's not easy.  Several people, including myself, are working
on compilation and runtime techniques to let you write programs using a
shared memory data model, but have them execute efficiently on a message
passer.  It's not an easy problem (Ph.D. thesis material), but there are
several promising approaches.  Stay tuned, but don't hold your breath.

--Rik