Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!van-bc!ubc-cs!uw-beaver!cornell!ken
From: ken@gvax.cs.cornell.edu (Ken Birman)
Newsgroups: comp.sys.isis
Subject: Fast causal multicast
Message-ID: <48345@cornell.UUCP>
Date: 14 Nov 90 14:51:40 GMT
Sender: nobody@cornell.UUCP
Reply-To: ken@cs.cornell.edu (Ken Birman)
Distribution: comp
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 163

I received the following inquiry and thought it might interest
the readers of this group.  comp.sys.isis is unmoderated and this
topic is debatable...  Feel free to post counter-arguments. 

> From scott@amethyst.omg.ORG Tue Nov 13 15:16:30 1990
> To: ken@gvax.cs.cornell.edu
> Subject: Fast Causal Multicast - multiple overlapping process groups

> Ken:

> Don't know if you'll remember me (Scott Meyer).  I attended Fingerlakes '89.
> At the time your description of bypass CBCAST piqued my interest.  The
> life of an engineer being what it is, I haven't had the chance to read
> your "Fast Causal Multicast" paper until now.  I do have some comments to
> which I'd be interested in hearing your response.

> Early on, you state: "Particularly valuable to us has been the ability
> to support multiple, possibly overlapping process groups..."  Could
> you provide examples (or pointers to examples) that substantiate this?

> In particular, I am skeptical of the need to enforce causal ordering
> across overlapping process groups as not all applications may need it.
> It seems to me that making this assumption engenders a disproportionate
> amount of complexity - runtime graph analysis, while not making it
> does not prevent applications that need such functionality from
> obtaining it.  To wit: suppose the membership of the groups g1 and g2
> overlaps.  Suppose futher that there is a set of messages, M, that
> we wish to deliver to both g1 and g2 with a causal ordering.  We can
> achieve this by creating a new group g3 = g1 U g2 and delivering
> all messages in M to g3.  This method does require distributed system
> developers to have a clear understanding of causal relationships that
> your bypass CBCAST protocol might save them the work of making explicit.
> The question in my mind is: "Is providing that extra service worthwhile
> given the extra (substantial) complexity that implementing it requires?
> If yes, why?"  Obviously, my own opinion is that it is not worthwhile
> to implement inter-group causal message ordering, and by extension that
> distributed system designers will have to have a good understanding of
> message causality (and a variety of other issues) in order to successfully
> implement a distributed system.

> Thanks in advance for your response.  Regards,

> Scott Meyer
> Netwise Inc.
> 2477 55th St.
> Boulder, CO  80301
> tel 303/442-8280

Scott,

We actually agree with your first point, but we also see a convincing
counter example.  As for the business about the need to maintain and
reason about the communication graph, I think you misunderstood how that
mechanism works (we reason about the graph, but we don't need to maintain
it anywhere).  

Issue 1: causality over group boundaries.

Let me grant that if two groups are not supposed to "interfere" then enforcing
causality could create an undesired effect.

But, say that I build an application in the style that the LINDA people
advocate.

I have a computational thread that writes some parameters on a blackboard
(some stuff that you might not need to see, but if you do, it costs a lot
to get it wrong).  Then, it puts a bunch of work requests into a bag
of work requests.

Server programs pull out work requests and start computing.  They can check
the blackboard if desired.

Now, I would argue that:
 1) You need to do the black board update asynchronously (if it is replicated)
    because otherwise the delay associated with waiting for the update to get
    done might be noticable and could even be a big problem (imagine a 1 BIP
    workstation in 1998 running over a comparatively slow link to an equally
    fast machine down the hall...  )  So, the bboard update is asynchronous,
    and the task bag update now creates a race condition.

    Specifically, will the server see the bboard update when it looks?
    With a causal guarantee, it will.

 2) This is not an uncommon paradigm.  In fact, with OO programming becoming
    common, this may actually be the usual style of distributed computing in
    the future.  I have lots of other examples of this sort (you register an
    activity and then someone who sees it goes off and thinks, then sends in
    an "answer" that races the original registration to some third party).

 3) Fault-tolerance can create more extreme problems, where "holes in the past"
    become a real difficulty.  Here, the issue is as much due to replication as
    asynchrony: I may have seen the state of some replica which then crashed,
    as did the original sender.  Now I am talking to some other replica in a
    state that "observed" a "past" update... will that replica ever see the
    update?  Will it see it before it hears from me?

The RPC community has run into this too: the Mercury project at MIT raised
many similar questions...

So do database systems.  Say that I do an update, then a commit.  The commit
releases a lock and you acquire it and do a read.  Do you see my update?
Well, if we run synchronously, sure you do.  But, if we run asynchronously
(for a big speedup) the exact same race arises.  And, again, it is easy to
believe that the process group used for locking is separate from the one used
for the replicated data item itself.  So, we definitely see a cross-group
causality issue here, too.

Robert and I are writing a paper on precisely this -- a paper on redoing
ISIS that asks what the appropriate rules for process groups ought to be.
Our paper argues for causality classes: causality would be enforced within
multiple groups in a single class, but might be violated over class boundaries.
The idea is that most groups, by default, would be in some single class.
But, if you build a system with a realtime subcomponent, or some other
independent mechanism in it, the group(s) associated with that mechanism
would be explicitly placed in a different class and hence kept separate from
the ones in the base class.

There are a number of additional issues that we are also looking at hard at, too: 
  1) Why have the system manage groups at all?  Why not put this at the
     application level?
  2) Given groups, why not do multicast using parallel RPC?  (Or, perhaps,
     transactions, if you want atomicity)?
  3) what do groups usually "look like" and how big are they?
  4) How much synchronization is needed between group membership and multicast
  5) Causality: see above
  6) Do we need the other ISIS protocols (ABCAST, etc, etc, etc, etc)
  7) Is there a simple, clean architecture for all this?

We'll make copies of this available when we finish the paper.  

Issue 2: reasoning about the communication graph

Next, regarding your question about the complexity of the graph needed in
the multicast paper (now being revised for ACM/TOCS).  We had a nice 
idea on this issue.  We realized (and subsequently learned of a brief
paper on this by Prof. Singhal -- U. Minn.) that if A -> B and A carries
a field VTg then if VTg hasn't changed, B can omit VTg.  This makes it
possible to just carry all VT's for all groups around on all messages, but to
actually transmit much smaller amounts of information.  So, we are revising
the paper to move the combined VT/LT stuff into a "special issues" section,
and to focus on the VT scheme with this optimization as our main protocol.

In particular, this change essentially eliminates the need to do any sort of
runtime estimation of the CG graph.  Instead, you just piggyback VT's
all around the system, but usually don't turn out to send very many of them.
This protocol is also extremely simple, and I wonder if it doesn't push the
balance back in the other direction?

I should also note that in the combined VT/LT scheme, we don't maintain the
CG graph explicitly.

The current TR version of the paper includes schemes for deducing some simple 
information about the graph based on information directly accessible by
looking at groups to which you belong, and being told about the groups to
which other members belong.  We never actually go out and build the graph
or explicitly test it for cycles.  Thus, even if you do use the VT/LT scheme,
which remains interesting because it guarantees that you don't need to
piggyback certain timestamps out, you certainly really don't need any
information that wouldn't already be available locally -- and in particular,
nobody actually maintains a structure called the CG graph.  The idea is
more like proving that a system can't deadlock by reasoning about a wait-for
graph: even without building the graph, you may be able to say something
about how it would look if you did go ahead and build it.