Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!samsung!munnari.oz.au!sirius.ucs.adelaide.edu.au!fang!itd1!agq
From: agq@itd1.dsto.oz (Ashley Quick)
Newsgroups: comp.sys.apollo
Subject: Re: Apollo unkillable processes
Keywords: sigp blast kill
Message-ID: <1229@fang.dsto.oz>
Date: 24 Sep 90 19:14:12 GMT
References: <2414@dover.sps.mot.com>
Sender: news@fang.dsto.oz
Lines: 133

anderson@atc.sps.mot.com (howard anderson) writes:


>I really need help.  (Apollo Release 10.2 with patch m0118.)
>I am having difficulty killing certain processes.  Sometimes the processes
>can be killed easily.  Sometimes they can't.  Some random factor is
    [... things deleted]

>Now it looks to me like these are all Apollo routines and that the
>user tasks have all been eliminated.  Apollo response center people
>agreed that this was the case.  They said that their system routines

You have not said what your application program is. It looks at lot like
a print server to me. Knowing the application would help a lot.

>may be waiting for some resource that a third-party vendor didn't release.
>Since all user code AND the third party vendor code has been sucessfully
>blown away at this point it looks like we will be waiting here a long time.
>(The Apollo response center is closing my call.  They told me to contact
>the third-party vendor because it is obviously a problem in the third-party
>vendor code.)

Is the trace back done AFTER you have sigp'd the process? What flags
were used to sigp it?

>Questions I have for you are these:

>  1.  This situation runs counter to my philosophy regarding the
>role of an operating system.  The user task has been eliminated
>by the operating system.  So now we wait forever for an event
>that cannot happen?  I would not have expected the operating
>system to lose control in a case such as this.  Are my expectations
>too high??

DOMAIN/OS supports multiple threads per task. The manuals make it very
clear that when this is used, special consideration needs to be given
to cleaning up the mess IF THE PROCESS IS aborted for any reason. (This
includes signalling it).

It would appear that the application developer has not followed the
guidelines.

>  2.  Has anyone else seen processes that cannot be killed with
>a sigp -s??  Perhaps I am the only Apollo user with this problem.

Yes. If your application is a print server, and it is waiting to
output via a SIO line, you can sigp it and nothing will happen UNTIL
the sio line lets the task become active. When the OS is waiting
for a sio line, nothing can stop the process. This is the only time
I have seen this. (And I have no great argument about it being
acceptable or not).

>  3.  Does anyone know a way to fake out the ec2_wait tasks and make
>them think they got what they are looking for?  How much damage do you
>think would result if one could do this?

No. It is perhaps possible IF you have the right knowledge and the
apollo version of the /usr/include/apollo files. It is not fixing the
problem, only the symptoms.

>  4.  Does anyone know what blasting processes such as these actually
>does to the operating system??  The server_process_manager sometimes
>exits.  Is this a possible effect of blasting a process such as the
>above??

Blasting is NASTY. It is still recommended that you shut down the node
after blasting.

>  5.  When processes such as the above that are using sio lines are blasted,
>the sio lines are left "locked".  They cannot be unlocked since they are
>not really "locked objects".  I found that copying /dev/sio1 to /dev/siox
>then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1
>will restore /dev/sio1 to service.

That is news to me. I just used to shut down the node to the phase 2
shell and re-start.

>  6.  If you are using DANFORD serial lines as well, they become "locked"
>in a similar manner.  Copying them and changing their names does not work.
>The ssiomonitor must be killed and restarted.  This means that all consoles
>served by the ssiomonitor must be shutdown and restarted in order to restart
>one line!

This is not surprising as the manager probably works the same way as the
sio manager. You could also try signalling the ssiomonitor with a quit
fault, to get it to re-try. (The apollo siomonit handles quit faults
specially to get it to re-start the lines)

>  7.  The group id of a forked child process is sometimes set to zero.
>This seems to occur randomly about 20 percent of the time.
>When the parent is killed, the child is not killed.  Has anyone
>else seen this problem?  (This seems to be unrelated to the unkillable
>process problem but perhaps in some way it IS related?)

Is this the same context as above? ie is your non-stoppable application
doing this? IF SO, a possible cause is from forking a multiple-threaded
process. As multiple threads (tasks) are a bit nasty, anything could happen!


>PLEASE HELP

Comments:

I need more information to make a more informed guess! The traceback
you included indicates that the process is definitely running with
tasks. It looks like a print server, but that is just a guess. It could
also be a program which uses NCS services.

TASKs are nasty things. When tasking is enabled in a process, the entire
process must be very carefully written. Some system services behave
differently, and some (esp. UNIX calls) should be avoided as they are not
re-entrant. Further, cleanup handlers should be used to allow the
tasks to be shut down cleanly. If tasks are not shut down cleanly, all
sorts of wierd things can happen. Also, signalling a process has
a different behaviour when tasking is enabled, and the handling is
more complex. (You have no idea which task is active at the time
a signal might be received, so the handling is messy!)

If this is a third party product, I suggest you start writing to the
suppliers of it.

I have seen other postings which suggest re-booting the node every day,
etc. This should not be necessary, and we do not do that. The ony time
I have had to do mass re-boots was during software development when things
go wrong. DOMAIN/OS is pretty stable. However, BLASTing processes is
not recommended by Apollo, so you must accept the consequences...


Ashleigh Quick
AGQ@dstos3.dsto.oz.au