Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!sri-spam!ames!sdcsvax!darrell From: heddaya@harvard.harvard.edu (Abdelsalam Heddaya--aka Solom) Newsgroups: comp.os.research Subject: Re: How do you tell if a remote site is alive? Message-ID: <3300@sdcsvax.UCSD.EDU> Date: Thu, 11-Jun-87 12:59:32 EDT Article-I.D.: sdcsvax.3300 Posted: Thu Jun 11 12:59:32 1987 Date-Received: Sat, 20-Jun-87 11:41:45 EDT Sender: darrell@sdcsvax.UCSD.EDU Lines: 44 Approved: mod-os@sdcsvax.uucp In-reply-to: darrell@sdcsvax.UCSD.EDU's message of 9 Jun 87 06:35:08 GMT The question of detecting failures in a distributed system is a very tricky one. The standard method of sending a "ping" message to the machine in question and timing out on the "pong" (or ack) suffers from the following two problems: 1. The machine may simply be slow in responding, and ends up being assumed dead by some machines on the networks, and live by some others. Worse still, it doesn't know that it is assumed dead and may happily proceed to destroy the consistency of shared data. One (unsatisfactory) solution is for the other machines to force the slow one to fail, aborting all the updates it may have done while in the twilight zone. This problem may go away if the ping-pong messages are handled by dedicatead hardware, which, if running, has constant speed. 2. Distinguishing between link failures and processor failures is hard, especially if the link failure is such that only some messages are dropped or delayed. A timeout can indicate either a message loss or delay, a link failure, or a processor failure. Again, if we are to successfully distinguish these failure modes, special low-level network support will be needed. In particular, the network must guarantee that if a path exists, the message will be delivered within some bounded time delay. A weaker version of this requirement, which turns a message delay into a message loss, is easily achievable. The network can guarantee that a message is either delivered within a bounded time, or not at all, by having the receiver check the sender timestamp on the message and dropping it if it is too far in the past. For these two reasons, many of the protocols in the literature do not require immediate or accurate failure detection (e.g. the various flavors of voting protocols). [ Voting protocols allow some flexibility by only requiring a majority of ] [ sites. The problem here is still "How long do you wait?" --DL ] ---Abdelsalam Heddaya heddaya@harvard.harvard.edu heddaya@harvard.cs.net heddaya@harvunxh.bitnet heddaya@harvard.uucp {rutgers, topaz, ihnp4, allegra, seismo, ...}!harvard!heddaya Aiken Lab, 33 Oxford St., Harvard U., Cambridge, MA 02138. (617) 495-3998