Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!columbia!rutgers!caip!brl-adm!brl-smoke!smoke!narten@purdue.arpa From: narten@purdue.arpa (Thomas Narten) Newsgroups: net.unix-wizards Subject: BSD Unix machines hanging Message-ID: <4362@brl-smoke.ARPA> Date: Sat, 4-Oct-86 14:18:01 EDT Article-I.D.: brl-smok.4362 Posted: Sat Oct 4 14:18:01 1986 Date-Received: Tue, 7-Oct-86 20:38:27 EDT Sender: news@brl-smoke.ARPA Lines: 39 We have been experiencing a rather odd and intermittant problem with our Unix machines. It is not confined to a particular machine or Unix; it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and uVAX II machines. Symptoms: The machines appear to lock up, users cannot get characters echoed, console is hung. In short, the machine seems dead. The only way to recover is a reboot. However, the machine is still running in a sense. One can ping the machine in question, and it responds. One can open a TCP connection to the machine, and the connection succeeds, but hangs at that point. When this happens, we have halted the cpu, looked at the PC, continued the system, repeating the above in hopes of finding the machine caught in a tight loop somewhere. It is not in a tight loop. In fact, when this nailed one of our idle machines, the system was spending all of its time in the context switch routine "Swtch". Other attempts at this have found the PC in unrelated procedures an each halt. This has hit most of our machines at one time or another, but usually only gets one at a time. Sometimes its a month between hangs, sometimes several times in a day. I suspect that we are tweaking some sort of networking bug where the setting of the processor priority level gets messed up, leaving the machine in a higher priority than it should be, so that user processes no longer are scheduled. Evidence to support this is an increase in network traffic on our Ethernets over the last 6 months. Also, the last time one of the machines hung, the last message on the console was a "qe0: restart" message, indicating that the DEQNA Ethernet board had become wedged. The problem is not restricted to machines with a DEQNA. Has anyone else run into a similar problem? Thomas ----------