Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!ken From: ken@cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: More on use of kill() to detect process failures Message-ID: <1991Jun25.120346.21111@cs.cornell.edu> Date: 25 Jun 91 12:03:46 GMT Sender: news@cs.cornell.edu (USENET news user) Organization: Cornell Univ. CS Dept, Ithaca NY 14853 Lines: 33 Originator: ken@turing3.cs.cornell.edu Nntp-Posting-Host: turing3.cs.cornell.edu A few weeks ago I suggested a simple loop for detecting process failures using the kill() system call, as a hack for a situation where SUN OS might fail to report a broken pipe. Some new insight on this: 1) The broken pipe business is much less common than I appreciated. If you see ISIS systematically fail to detect process termination in some situation, perhaps the process isn't really exiting, or perhaps someone forked a child and didn't close isis_socket, leaving a dup around that fools bin/protos into not noticing when the child exits. isis_disconnect() does this (it closes isis_socket and intercl_socket), and hence should be called after a fork/vfork in the child process I tried to write a test program to demonstrate this bug and actually did see it perhaps once in a hundred runs, but it clearly depends on something uncommon happening just when the pipe breaks -- paging activity, perhaps. Usually, SUN OS detects the condition perfectly. 2) the kill() solution works quite well, but you can't use the debugger on the programs being probed this way, since you get a debugger trap every few seconds. So, if you use this, you will not be able to use dbx/gdb on the active program, a strong disadvantage in my opinion. My plan is to make this an compile time option to protos, disabled by default. Ken -- Kenneth P. Birman E-mail: ken@cs.cornell.edu 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office) Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428