Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!columbia!rutgers!ucla-cs!ames!sdcsvax!darrell From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.os.research Subject: Re: How do you tell if a remote site is alive? Message-ID: <3299@sdcsvax.UCSD.EDU> Date: Wed, 10-Jun-87 15:17:23 EDT Article-I.D.: sdcsvax.3299 Posted: Wed Jun 10 15:17:23 1987 Date-Received: Sat, 20-Jun-87 10:21:48 EDT Sender: darrell@sdcsvax.UCSD.EDU Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 22 Approved: mod-os@sdcsvax.uucp In the ISIS system, we use a software protocol that triggers higher level failure actions. The protocol is very fast and quite simple -- a multiphase commit, basically. It gets triggered by timeouts, but the idea is that when an overload occurs or something else causes an incorrect timeout, the system shouldn't suddenly become inconsistent. This is important because timeouts are really flakey in LAN systems (as any SUN user will tell you). Thus, it is hard to build really robust software on top of a purely timeout based scheme. Our approach has no overhead at all except when a failure actually takes place and normally kicks in within about 2 seconds of a crash (but this can be tuned adaptively, say if overloads are common). People may want to refer to a recent publication in the February issue of TOCS for details, or send to me requesting reprints of this and other ISIS related papers. The protocols described in that paper, including the failure detector, are operational now at Cornell. We'll have performance figures soon. If you know us mostly from our old work on resilient objects, you may want to look at this new material. It focuses on fault-tolerance and reliability without transactions (I no longer believe in transactions) and uses a toolkit approach. Much of the rest of the system is running too, and we will have some nontrivial application software coming up during the summer. Ken Birman