Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!bbn!mit-eddie!uw-beaver!rice!sun-spots-request From: mjk@fluffy.rice.edu (Mark J. Kilgard) Newsgroups: comp.sys.sun Subject: Re: Daemons stuck in 'D' "short-term" wait state Message-ID: <8902201648.AA04407@fluffy.rice.edu> Date: 2 Mar 89 07:10:19 GMT Sender: usenet@rice.edu Organization: Sun-Spots Lines: 45 Approved: Sun-Spots@rice.edu Original-Date: Mon, 20 Feb 89 10:48:37 CST X-Sun-Spots-Digest: Volume 7, Issue 178, message 2 of 13 We recently experienced a problem similiar to the one Rob McMahon (v7n148) and more recently Dwight Ernest (v7n161) experienced. We started getting NFS file server not responding errors from one of our file servers. When I logged into the file server and did a ps, it showed all the nfsd's in the 'D' "short-term" wait state. Logically the file server worked fine but all its clients hung when they tried to access it over NFS. It was impossible to kill the nfsd's and attempts to start a new set failed. We were running 8 by the way but I don't think that is important. We rebooted the machine with 'reboot' and the fsck failed. We did a manual fsck and 'fix'ed 5-10 inconsistencies. We changed the configuration to bring the machine up with only 4 nfsd's and brought the machine up multi-user. Everything was fine till about 27 hours later when those 4 nfsd's were found again in the D state. The machine was 'reboot'ed and again the fsck failed. With some inspection by John Deuel an anomaly was found in the the /barn/lost+found directory where the fsck had complained. Link counts were all messed up and it appeared that there were two copies of the lost+found inode??? John clri'ed the lost+found inode and did an fsck to fix the resulting mess. The machine was rebooted and has been running for 36 hours now. It is running with 8 nfsd's presently. There don't seem to be any more problems. It seems reasonable to think that the nfsd's might get confused by anomalies in the file system and hang in a D state. Or could it be that the nfsd's got screwed up and possibily created the anomaly? I can't explain the cause of the initial fsck problems - the system had been running for nearly a week without down time before the occurance. It seemed that the first fsck didn't fix the anomaly (or maybe it just reappeared?). Perhaps a small glitch in fsck? Have people had similiar experiences? If so, what did you guess the cause to be? Were there fsck problems before? - Mark