Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!usc!zaphod.mps.ohio-state.edu!rpi!batcomputer!cornell!ken From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Re: startup failure on SGI Message-ID: <47943@cornell.UUCP> Date: 5 Nov 90 00:17:40 GMT References: <125374@linus.mitre.org> Sender: nobody@cornell.UUCP Reply-To: ken@gvax.cs.cornell.edu (Ken Birman) Distribution: usa Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 72 In article <125374@linus.mitre.org> pop@linus.mitre.org (Paul Perry) writes: > >I compiled ISIS with BYPASS on an SGI 4D/220VGX (IRIX System V Release >3.3.1 uci4). It compiled well (quite fast!), but failed on the >startup sequence. I think I have all the appropriate config files >right (sites, and /etc/services) but I don't know how to interpret the >log files. Could someone take a look ? Attached are the output of >the startup sequence, and the log file. Thanks, Paul. FYI, we have a mailing address "isis-bugs@cs.cornell.edu" that works better than comp.sys.isis for this sort of problem. There seem to be several problems here, the first of which being that at Cornell we have not worked directly with ISIS on an SGI workstation. Someone provided us with the port and said that it works, but the evidence is that it has a problem, as discussed below. The problem may be configuration specific, and it doesn't relate to BYPASS/non-BYPASS (things didn't get far enough for that to matter). The problems I see, in order are: 1) Your isis "sites" file is formatted incorrectly (it lacks the final "scope" information field). Apparently, you used lines like 14: 1234,1235,1236 uci4.mitre.org rather than 14: 1234,1235,1236 uci4.mitre.org mitre or even 14: 1234,1235,1236 uci4.mitre.org mitre,sgi (look this up if you are unclear on what I am talking about) 2) Your system naming service is not able to resolve full names in any case. The lines reading "Ignoring site uci11.mitre.org" are because "gethostbyname("uci11.mitre.org") is failing on your machine. Ask an administrator... probably something wrong with /etc/hosts. 3) The real problem, the one that caused the crash, is that "bin/isis" is unable to exchange messages with "bin/protos". In principle, this is done in the following steps: 3-a) because UNIX_DOMAIN is enabled in pr.h, isis.h, when isis first runs protos, protos creates a unix-domain socket named "/tmp/Is1235" using the second ("tcp") entry from the sites file you made. Various issues involving permission to create this (or a badly chosen umask that disables write permission, for example) could make this socket inaccessible to other programs... something of this sort is probably responsible. 3-b) bin/isis tries to "connect" (see connect(2)) to this socket. The sequence is that it creates a socket of its own, gives it a name based on its own process-id (i.e. /tmp/Cl5432), and issues a connect system call. This fails, which is not normal, and so you get a message about an unexpectedly slow protos startup. After a few retries, bin/isis gives up. 3-c) Meanwhile, bin/protos is getting antsy -- why hasn't bin/isis connected in yet? After a while it times out and does a panic-exit. This always prints the sort of dump you saw. This time, the dump is quite boring. What to do about this? 1) Fix the sites and /etc/hosts file; who knows, perhaps this is related. 2) ls -l /tmp to see what the story is during the first 15-20 seconds after startup. 3) If all else fails, consider changing isis.h and pr.h to NOT set UNIX_DOM (i.e. just leave the #define UNIX_DOM 1 out) and recompile all the system binaries. This has a good chance of working. Since we are into a posting loop, I guess it might not hurt to post the explanation once the thing is working... Ken Birman