Path: utzoo!attcan!uunet!wuarchive!cs.utexas.edu!rutgers!news-server.csri.toronto.edu!utgpu!cunews!news From: holtz@zonker.cascade.carleton.ca (Neal Holtz) Newsgroups: comp.sys.apollo Subject: YARR (Yet Another Registry Rot) Message-ID: <1990Oct9.162600.16610@ccs.carleton.ca> Date: 9 Oct 90 16:26:00 GMT Sender: news@ccs.carleton.ca (news) Organization: none Lines: 98 I am sure this question has been asked (and answered) before, but I can't find it in my archives... Configuration: 1 - DN4500, disked 3 - DN3500s, disked 4 - DN 2500s, diskless - one booted off each disked node SR10.2 Aegis & BSD 1 Master registry, on DN4500, no replications DN4500 runs glbd, llbd, rgyd each DN3500 runs llbd Problem: With time, the registry server becomes unavailable on the DN3500's and users are logged in using the local registries. This seems to take the form of a gradual rot, with more of the DN3500s becoming 'serverless'. Also, of course, '/etc/passwd' is unreadable (doesn't exist). However, the registries are still available to the diskless DN2500s booted off the DN3500s, and /etc/passwd is OK. And the DN3500 has no trouble seeing the files on and otherwise communicating with the rgy server node. Attempted fixes: We have rebooted everything, we have manually restarted various servers, and looked at a few log files in in /usr/adm. Nothing worked, and no clues, either. Debugging via Apollos hot line seems to take a long time, as well. The clocks are set to within a few seconds. Details: Processes on the Master Registry node (DN4500): 1 > ps -ax PID TTY STAT TIME COMMAND 1 ? S < 0:23 /etc/init 2 ? R 1461:10 null 3 ? S 0:51 purifier 4 ? S 0:29 purifier 5 ? S 0:50 unwired_dxm 6 ? S 0:00 pinger 7 ? S 0:04 netreceive 8 ? S 0:39 netpaging 9 ? S 0:18 wired_dxm 10 ? S 0:38 netrequest 91 ? S 5:10 /etc/tcpd 96 ? S 1:25 /etc/routed -f -q 99 ? S 0:00 /etc/inetd 102 ? S 0:00 /etc/ncs/llbd 104 ? S 1:49 /etc/ncs/glbd 107 ? S < 2:02 /etc/rgyd 112 ? S 0:04 /sys/spm/spm 115 ? S 0:03 /sys/net/netman 117 ? S 0:03 /sys/ns/ns_helper 120 ? S 0:08 /sys/alarm/alarm_server -disk 98 -msg -w 0 0 550 100 - 122 ? S 0:01 /sys/mbx/mbx_helper 125 ? S < 0:08 /etc/Xapollo -K /usr/X11/lib/keyboard/keyboard.config 127 ? S < 8:29 dm processes on the serverless DN3500: Connected to node 19A9D "//thorin" login: Password: Using local registry. Can't use network registry: - Registry server unavailable (from RGYC / Server) 1 > ps -ax PID TTY STAT TIME COMMAND 1 ? S < 0:28 /etc/init 2 ? R 158:32 null 3 ? S 0:05 purifier 4 ? S 0:00 purifier 5 ? S 0:09 unwired_dxm 6 ? S 0:00 pinger 7 ? S 0:00 netreceive 8 ? S 0:23 netpaging 9 ? S 0:02 wired_dxm 10 ? S 0:13 netrequest 92 ? S 0:00 /etc/ncs/llbd 97 ? S 0:01 /sys/spm/spm 99 ? S 0:03 /sys/net/netman 101 ? S 0:02 /sys/alarm/alarm_server -disk 98 -msg -w 0 0 550 100 -v 20 20 105 ? S 0:00 /sys/mbx/mbx_helper 107 ? S < 0:05 /etc/Xapollo -K /usr/X11/lib/keyboard/keyboard.config -D1 s+r- 109 ? S < 4:09 dm Perhaps I'll dig out my SR8 floppies and re-install :-( -- Prof. Neal Holtz, Dept. of Civil Eng., Carleton University, Ottawa, Canada Internet: holtz@civeng.carleton.ca Tel: (613)788-5797 Fax: (613)788-3951