Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!snorkelwacker!spdcc!dyer From: dyer@spdcc.COM (Steve Dyer) Newsgroups: comp.unix.wizards Subject: Re: tip processes on the SUN Message-ID: <1000@ursa-major.SPDCC.COM> Date: 27 Dec 89 17:08:00 GMT References: <21875@adm.BRL.MIL> Reply-To: dyer@ursa-major.spdcc.COM (Steve Dyer) Organization: S.P. Dyer Computer Consulting, Cambridge MA Lines: 38 In article <21875@adm.BRL.MIL> swenson@nusc-wpn.arpa writes: > We are using the standard tip line to a remote VME cage (i.e. just >another machine). During (what appears to be) some relatively high bandwidth >data transfers, the tip line loses its mind. Do a ps and the tip shows up as >. Try to kill the process -- it won't die. During fastboot we >get a message like "Warning processes wouldn't die -- suggest using ps" >(we are truly afraid at this point). The questions are, why is the tip >line hanging up (difficult to answer with limited information, I know), >and is there a way to kill the process without rebooting the system? Almost always, when a process is stuck in the state, it's in the middle of a device-specific close routine called from the exit code. A process can invoke the exit code either explicitly through the exit system call or in response to most signals which have SIG_DFL handling. Here, the device-specific close routine would probably be for the serial I/O hardware. If the device-specific close routine (or a routine it calls) sleeps, and for one reason or another there is no wakeup() forthcoming, you will get into this kind of a situation. Usually, the close routine in TTY drivers attempts to flush the characters on the output clist to the hardware before returning from the close. Now, with hardware problems or bugs in the driver itself, if the output interrupt never happens or it doesn't manage to issue a wakeup, the process will be hung up on a sleep() inside the exit code. You can issue a "kill" as much as you want. What it will do each time, however, is to interrupt the sleep and restart the exit code. The exit code will loop through all open files and call the device-specific close routine again and get stuck one more time. Without rewriting the device driver to handle this pathological situation (or ingenious adb hacking on an active kernel), the easiest way to recover from this is to reboot. This is a general description of what can go wrong--it isn't Sun-specific. -- Steve Dyer dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer dyer@arktouros.mit.edu, dyer@hstbme.mit.edu