Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!munnari.oz.au!sirius.ucs.adelaide.edu.au!fang!itd0!agq
From: agq@itd0.dsto.oz (Ashley Quick)
Newsgroups: comp.sys.apollo
Subject: Re: create remote process
Message-ID: <1174@fang.dsto.oz>
Date: 10 Aug 90 01:34:54 GMT
Sender: news@fang.dsto.oz
Reply-To: agq@dstos3.dsto.oz (Ashleigh Quick)
Organization: Defence Science and Technology Organisation
Lines: 156
References:
Sender:
Followup-To:
Distribution:
Keywords:


My previous posting mentioned a big BIG problem.

I have exchanged some E-Mail with others about this, and am posting to
get a wide coverage.

We have a node here running SR10.1. When we run three print servers,
strange things begin to happen... like "prf -list_pr" will fail with a
message "unable to locate printers for site xxxxx - unable to bind
socket". When it is sick the /etc/ncs/lb_admin utility will not talk
to the local location broker (llbd), you cannot CRP off or on the
node, etc.

Killing one of the print servers makes things a little better. Then
you may be able to CRP onto the node once of twice... after that CRP
will just die (if coming in from elsewhere), and trying to CRP out of
the sick node bombs with a similar error to that above.


Here is an edited version of what I have sent to Dave Krowitz, which
contains an edited version of some of his earlier suggestions....


Msg> Recently you mailed me with some info about our wierd and wonderful
Msg> problem of 'no more sockets' (also known as 'cant bind socket').
Msg>
Msg> You sent:
Msg>
Msg> >
Msg> >My guess is that the problem is not with your pty's. NCS is a method of
Msg>         [... more on ptys]
Msg> >
Msg> >If /etc/ping, ftp, telnet, rlogin, etc. work between the nodes in question,
Msg> >then your TCP services are probably OK. /etc/ping will tell you that *some*
Msg>       [etc]
Msg>
Msg> TCP services are working OK. We run tcpd and inetd on every node in our
Msg> network. One central group do the administration, OS build/install, etc.
Msg> They are as fooled by this problem as anybody.
Msg>
Msg> >If your TCP services seem ok, then start checking your llbd's on the nodes
Msg> >in question, and the glbd's on all nodes in your network which run the global
Msg> >broker. /etc/ncs/llbd_admin and /etc/ncs/drm_admin are the tools to use for
Msg> >this. drm_admin will tell you if the global databases are out of synch and
Msg> >if the clocks on the nodes are different. Run it on each node in question
Msg> >and see that the list of glbd sites is the same on each node! (some nodes
Msg> >may only see a subset of all the glbd's that are supposed to be running).
Msg> >
Msg>
Msg> OK.
Msg> On our net we have 3 glbd's running. I have checked them. They all
Msg> know about each other, on the right nodes. The clocks are in sync to
Msg> within  about 30 seconds. [Our sys admin people complained bitterly
Msg> about the crummy hardware which lets the clocks slip - when system
Msg> software depends on them being accurate.]
Msg>
Msg> I ran /etc/ncs/lb_admin on each of the nodes, and cleaned up the glb
Msg> and llb data bases. (Some of which did contain old/inaccurate
Msg> garbage].
Msg>
Msg> The problem, after all of this, has not gone away. It only seems to
Msg> happen [be most apparent] when I have 3 prsvrs running.
Msg>
Msg> To recap: A DN4500, running SR10.1. We run three print servers (One is
Msg> a LaserJet with my own driver [which is incomplete - but works
Msg> enough], another is a line printer [via a National Instruments GPIB
Msg> port!!!!], and a HP7550 plotter. This gives service for a number of
Msg> applications, including Mentor Graphics, and simulation tools from
Msg> Eesof.   This node also runs the print manager (for this "site"), as
Msg> well as tcpd, inetd, llbd, glbd, spm, etc....
Msg>
Msg> The problem is not always apparent. When it is there, I have noticed
Msg> the following:
Msg>
Msg>     prf -list_printers
Msg>          This will list each of the print manager sites in the
Msg>          network, with a message saying something like 'unable to
Msg>          locate printers for site xxxxxx - unable to bind socket'. (Or
Msg>          was it '... - no more free sockets'?)
Msg>
Msg>     /etc/ncs/lb_admin
Msg>          When the problem is apparent, this will NOT COMMUNICATE at
Msg>          all with the local location broker. (ie cannot lookup, clean,
Msg>          etc ).
Msg>
Msg>     crp -on fred -me
Msg>          Doing this from the problem node fails with the same message
Msg>          about no more sockets.
Msg>
Msg> All I do is to kill any one of the three print servers for things to
Msg> get better. So far, our sys admin say thats what we should do. (Not a
Msg> solution to the problem, though.)
Msg>
Msg> When the node in question is 'sick', other nodes can use prf -list_pr
Msg> and see the printers which the sick nodes print manager is managing.
Msg>
Msg> SOMETIMES taking the sick node down to the phase II shell and coming
Msg> up again will cure it. For a while.
Msg>
Msg> When the node is not sick, it will eventually become sick. No operator
Msg> intervention is required to bring on a bout of sickness!!!
Msg>
Msg>
Msg>
Msg> Questions:
Msg>
Msg> Is there a limit in DOMAIN/OS on the number of print servers that can
Msg> be run on a node? (And if so, WHY????)
Msg>
Msg> Is there a limit on the number of 'sockets' available for NCS type
Msg> services? (Again - if so why?) If there is a limit - can it be
Msg> configured in any way??????
Msg>
Msg> Has anybody else seen this? Sould I report it as an APR or am I doing
Msg> something really stupid?
Msg>
Msg> This seems to indicate a fairly major problem in NCS - as if something
Msg> somewhere is using resources (sockets?), and not freeing them
Msg> afterwards. (Or maybe the old un-initialised variable trick?!).
Msg>
Msg> Maybe it gets cured in later releases? (I wait for the day we go up to
Msg> SR10.2 - only our Mentor stuff is holding us back).

    [end of E-mail message]

Since sending this off, I have done some more investigating. I started
with the sick node, and from another node, tried to CRP onto node
'sick' (not its real name - but I may as well protect the
innocent[?!]). I looked at how many remote processes could log in,
and found that as I killed processes on node 'sick', I could create
more remote processes before things died (ie went sick). As it looks a
lot like 'crp' uses NCS services, this seems fair enough.

Then, I killed off process 'netman'. (Diskless node boot server I
think). Bingo. All came good. But after a re-start [=>phase 2 and
back again], things are their normal sick selves. (Netman is still
there).

It appears to me that there is some kind of limitation brought about
by NCS services just running out. Also, dont blame my own home grown
servers - I killed them off and things can still get sick! (I also do
not believe that server processes which open mailboxes and wait on
event counts can really make things misbehave so badly - although it
does acquire a device - but that would just be too silly...)

So, does anybody have any suggestions / comments?
See questions above.
Will SR10.2 fix this?


Yours in frustration

Ashleigh Quick
AGQ@dstos3.dsto.oz.au