Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!munnari.oz.au!sirius.ucs.adelaide.edu.au!fang!itd0!agq From: agq@itd0.dsto.oz (Ashley Quick) Newsgroups: comp.sys.apollo Subject: Re: create remote process Message-ID: <1174@fang.dsto.oz> Date: 10 Aug 90 01:34:54 GMT Sender: news@fang.dsto.oz Reply-To: agq@dstos3.dsto.oz (Ashleigh Quick) Organization: Defence Science and Technology Organisation Lines: 156 References: Sender: Followup-To: Distribution: Keywords: My previous posting mentioned a big BIG problem. I have exchanged some E-Mail with others about this, and am posting to get a wide coverage. We have a node here running SR10.1. When we run three print servers, strange things begin to happen... like "prf -list_pr" will fail with a message "unable to locate printers for site xxxxx - unable to bind socket". When it is sick the /etc/ncs/lb_admin utility will not talk to the local location broker (llbd), you cannot CRP off or on the node, etc. Killing one of the print servers makes things a little better. Then you may be able to CRP onto the node once of twice... after that CRP will just die (if coming in from elsewhere), and trying to CRP out of the sick node bombs with a similar error to that above. Here is an edited version of what I have sent to Dave Krowitz, which contains an edited version of some of his earlier suggestions.... Msg> Recently you mailed me with some info about our wierd and wonderful Msg> problem of 'no more sockets' (also known as 'cant bind socket'). Msg> Msg> You sent: Msg> Msg> > Msg> >My guess is that the problem is not with your pty's. NCS is a method of Msg> [... more on ptys] Msg> > Msg> >If /etc/ping, ftp, telnet, rlogin, etc. work between the nodes in question, Msg> >then your TCP services are probably OK. /etc/ping will tell you that *some* Msg> [etc] Msg> Msg> TCP services are working OK. We run tcpd and inetd on every node in our Msg> network. One central group do the administration, OS build/install, etc. Msg> They are as fooled by this problem as anybody. Msg> Msg> >If your TCP services seem ok, then start checking your llbd's on the nodes Msg> >in question, and the glbd's on all nodes in your network which run the global Msg> >broker. /etc/ncs/llbd_admin and /etc/ncs/drm_admin are the tools to use for Msg> >this. drm_admin will tell you if the global databases are out of synch and Msg> >if the clocks on the nodes are different. Run it on each node in question Msg> >and see that the list of glbd sites is the same on each node! (some nodes Msg> >may only see a subset of all the glbd's that are supposed to be running). Msg> > Msg> Msg> OK. Msg> On our net we have 3 glbd's running. I have checked them. They all Msg> know about each other, on the right nodes. The clocks are in sync to Msg> within about 30 seconds. [Our sys admin people complained bitterly Msg> about the crummy hardware which lets the clocks slip - when system Msg> software depends on them being accurate.] Msg> Msg> I ran /etc/ncs/lb_admin on each of the nodes, and cleaned up the glb Msg> and llb data bases. (Some of which did contain old/inaccurate Msg> garbage]. Msg> Msg> The problem, after all of this, has not gone away. It only seems to Msg> happen [be most apparent] when I have 3 prsvrs running. Msg> Msg> To recap: A DN4500, running SR10.1. We run three print servers (One is Msg> a LaserJet with my own driver [which is incomplete - but works Msg> enough], another is a line printer [via a National Instruments GPIB Msg> port!!!!], and a HP7550 plotter. This gives service for a number of Msg> applications, including Mentor Graphics, and simulation tools from Msg> Eesof. This node also runs the print manager (for this "site"), as Msg> well as tcpd, inetd, llbd, glbd, spm, etc.... Msg> Msg> The problem is not always apparent. When it is there, I have noticed Msg> the following: Msg> Msg> prf -list_printers Msg> This will list each of the print manager sites in the Msg> network, with a message saying something like 'unable to Msg> locate printers for site xxxxxx - unable to bind socket'. (Or Msg> was it '... - no more free sockets'?) Msg> Msg> /etc/ncs/lb_admin Msg> When the problem is apparent, this will NOT COMMUNICATE at Msg> all with the local location broker. (ie cannot lookup, clean, Msg> etc ). Msg> Msg> crp -on fred -me Msg> Doing this from the problem node fails with the same message Msg> about no more sockets. Msg> Msg> All I do is to kill any one of the three print servers for things to Msg> get better. So far, our sys admin say thats what we should do. (Not a Msg> solution to the problem, though.) Msg> Msg> When the node in question is 'sick', other nodes can use prf -list_pr Msg> and see the printers which the sick nodes print manager is managing. Msg> Msg> SOMETIMES taking the sick node down to the phase II shell and coming Msg> up again will cure it. For a while. Msg> Msg> When the node is not sick, it will eventually become sick. No operator Msg> intervention is required to bring on a bout of sickness!!! Msg> Msg> Msg> Msg> Questions: Msg> Msg> Is there a limit in DOMAIN/OS on the number of print servers that can Msg> be run on a node? (And if so, WHY????) Msg> Msg> Is there a limit on the number of 'sockets' available for NCS type Msg> services? (Again - if so why?) If there is a limit - can it be Msg> configured in any way?????? Msg> Msg> Has anybody else seen this? Sould I report it as an APR or am I doing Msg> something really stupid? Msg> Msg> This seems to indicate a fairly major problem in NCS - as if something Msg> somewhere is using resources (sockets?), and not freeing them Msg> afterwards. (Or maybe the old un-initialised variable trick?!). Msg> Msg> Maybe it gets cured in later releases? (I wait for the day we go up to Msg> SR10.2 - only our Mentor stuff is holding us back). [end of E-mail message] Since sending this off, I have done some more investigating. I started with the sick node, and from another node, tried to CRP onto node 'sick' (not its real name - but I may as well protect the innocent[?!]). I looked at how many remote processes could log in, and found that as I killed processes on node 'sick', I could create more remote processes before things died (ie went sick). As it looks a lot like 'crp' uses NCS services, this seems fair enough. Then, I killed off process 'netman'. (Diskless node boot server I think). Bingo. All came good. But after a re-start [=>phase 2 and back again], things are their normal sick selves. (Netman is still there). It appears to me that there is some kind of limitation brought about by NCS services just running out. Also, dont blame my own home grown servers - I killed them off and things can still get sick! (I also do not believe that server processes which open mailboxes and wait on event counts can really make things misbehave so badly - although it does acquire a device - but that would just be too silly...) So, does anybody have any suggestions / comments? See questions above. Will SR10.2 fix this? Yours in frustration Ashleigh Quick AGQ@dstos3.dsto.oz.au