Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!julius.cs.uiuc.edu!ux1.cso.uiuc.edu!mp.cs.niu.edu!rickert From: rickert@mp.cs.niu.edu (Neil Rickert) Newsgroups: comp.unix.misc Subject: Re: killing a process gone bad. Message-ID: <1990Nov1.175243.20810@mp.cs.niu.edu> Date: 1 Nov 90 17:52:43 GMT References: <1990Oct25.185822.11838@nntp-server.caltech.edu> <1990Oct30.032707.1222@brian386.uucp> <1119@massey.ac.nz> Organization: Northern Illinois University Lines: 89 In article <1119@massey.ac.nz> GEustace@massey.ac.nz (Glen Eustace) writes: >We recently had the exact situation described in the previous >posting. There was a little more code involved but the net effect >was the same. All attempts to clear out the system failed as there >was no spare CPU available to allow remedial action to be taken. The >problem was cured by a reboot. > >Following our problem, the perpertrator posted to comp.unix.questions >to find out what we could have done. We received various replies >including the 'kill -9 -1' variety. > We have 10 processors. Simple killing of replicating processes never works, because more are created as fast as old ones are killed off. I regularly see students who inadvertently create the problem, and finish up running out of processes (the local per-user limit is 50). I have NEVER had to reboot to resolve this problem. My experience is with a BSD system, so may not apply to SysV. Here are three simple approaches to try: (1) The simple-minded approach. Look for a file which the programs depend on. Try removing or renaming that file. In particular, if the replicating process seems to be a shell script, look for a shell script in the user's directory named 'test'. (2) The slow and tediou method. This is a method I sometimes ask the student and/or his instructor to use. It is somewhat slow, as it requires killing all the processes individually. It usually works. Step 1. Find a list of the bad processes. If the student is doing this himself, he can ask a friend on a different account to do a 'ps uax|grep user' for this purpose. Failing that, he should be able to login, and then used 'exec ps ug'. This will give the list of processes, but log him out again. Step 2. Armed with a list of process IDs, start killing them with the STOP signal. exec /bin/kill -STOP pid pid pid ... The idea is to prevent further replication, but keep the processes in place so that you are always at the limit. This step, and Step 1 may have to be repeated several times to stop them all. Step 3. Start killing the STOPPED processes. To do this you will need the output of 'ps l'. You must not kill a child before killing the parent. Killing the child may cause the parent to wake up, and go back to its errant ways of replicating itself. Most of the time when you see this some of the processes have process 1 as the parent ID. The procedure is to kill all of the errant processes whose PPID is 1. Keep repeating this step till they are all gone. Usually this becomes easier as you proceed, for you stop getting the 'out of processes' message after a killing a few, and no longer need to 'exec /bin/kill' and relogin after every try. (3) The brute force method. I posted a script to do this recently. It was posted as article <1990Oct26.140851.11707@mp.cs.niu.edu>. Read that article for full information. It requires that you be root to execute it, and it requires that the perpetrator's login shell be 'csh' (because 'kill' is then builtin and doesn't require a new process). The basic idea is 'blocking'. You keep the number of processes at the limit, so as to prevent further replication. The script does the following: for each errant process create a new process (/bin/csh) for the user. kill the errant process the new process exec's to 'sleep 10 minutes' so as to be relatively harmless. If the processes are dying as well as replicating, my script may need to be rerun a few times. But, regardless, it soon creates enough sleeps under the userid that further replication of all errant processes is impossible, so they either all die out naturally, or sit around long enough to be killed. I have thought of rewriting the script as a C-program. It would be SUID, so that anyone could use it. Basically it would allow a user to type 'exec superkill' to kill all of his processes. I have never bothered to do this because the problem does not seem to crop up often enough to go to the trouble. -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science Northern Illinois Univ. DeKalb, IL 60115. +1-815-753-6940