Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!columbia!rutgers!ucla-cs!ames!sdcsvax!darrell
From: ken@gvax.cs.cornell.edu (Ken Birman)
Newsgroups: comp.os.research
Subject: Re: How do you tell if a remote site is alive?
Message-ID: <3299@sdcsvax.UCSD.EDU>
Date: Wed, 10-Jun-87 15:17:23 EDT
Article-I.D.: sdcsvax.3299
Posted: Wed Jun 10 15:17:23 1987
Date-Received: Sat, 20-Jun-87 10:21:48 EDT
Sender: darrell@sdcsvax.UCSD.EDU
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 22
Approved: mod-os@sdcsvax.uucp

In the ISIS system, we use a software protocol that triggers higher level
failure actions.  The protocol is very fast and quite simple -- a multiphase
commit, basically.  It gets triggered by timeouts, but the idea is that
when an overload occurs or something else causes an incorrect timeout, the
system shouldn't suddenly become inconsistent.  This is important because
timeouts are really flakey in LAN systems (as any SUN user will tell you).
Thus, it is hard to build really robust software on top of a purely timeout
based scheme.  Our approach has no overhead at all except when a failure
actually takes place and normally kicks in within about 2 seconds of a
crash (but this can be tuned adaptively, say if overloads are common).

People may want to refer to a recent publication in the February issue of
TOCS for details, or send to me requesting reprints of this and other ISIS
related papers.  The protocols described in that paper, including the failure
detector, are operational now at Cornell.  We'll have performance figures
soon.  If you know us mostly from our old work on resilient objects, you
may want to look at this new material.  It focuses on fault-tolerance and
reliability without transactions (I no longer believe in transactions) and
uses a toolkit approach.  Much of the rest of the system is running too, and
we will have some nontrivial application software coming up during the summer.

Ken Birman