Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request From: dan@watson.bbn.com (Dan Franklin) Newsgroups: comp.sys.sun Subject: Sun-4 severe NFS problem Keywords: Networks Message-ID: <8903212158.AA22971@rice.edu> Date: 30 Mar 89 20:44:18 GMT Sender: usenet@rice.edu Organization: Sun-Spots Lines: 56 Approved: Sun-Spots@rice.edu Original-Date: Tue, 21 Mar 89 16:57:08 EST X-Sun-Spots-Digest: Volume 7, Issue 216, message 8 of 13 We're having severe NFS problems involving our (only) Sun 4/110, running SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a "large" file (greater than 2k bytes or so) between this machine and any of several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix 2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always hang. We've seen the problem while copying: 1) from a Sun-4 directory to one on the Sun-3/160, while on the Sun-4, 2) from a Vax directory to one on the Sun-4, while on the Sun-4, 3) from a Sun-4 directory to one on the Sun-3/160, while on a Sun-3/50. We can copy tiny files without any problem. But we get long delays when copying larger files, ranging up to a delay of infinity :-) We ran experiments mostly copying between Suns (cases 1 and 3). The definition of "large" is not constant, but seems to be between 1k and 2k bytes. With files greater than that, the cp hangs; sometimes it returns, but usually not. Generally we get an accompanying "NFS server not responding still trying" error message. It usually doesn't return at all, until it's been interrupted. The trace command reveals that the cp hangs in a variety of places: doing a "stat" on the destination directory, or writing to the destination file, or closing it--but always an operation involving the destination. While a cp is hung, all of the machines involved in the cp operation continue to respond to other commands, including other NFS commands. However, on the initiating machine, you cannot access the directory containing the file being cp'd. For example, in case 1, an "ls", on the Sun-4, of the remote directory containing the file being copied will also hang. But you can look at that file on the serving machine, as well as on other machines besides the Sun-4 that have that file mounted. Other network services, including FTP and rlogin, work perfectly. These symptoms seem to be quite different from those discussed in other Sun-4 hanging situations. No nfsd ever ends up in a permanent "D" wait state on any of the machines, including the Sun-4. Unrelated NFS activities on the two machines in question work fine. Our problem sounded a little like the interrupt priority bug discussed by Charles Hedrick recently, so I tried raising the priority of splnet() to 2 and then to 3 by patching the kernel according to his instructions. It didn't help. Naturally, we've called the Sun Hotline. They said they'd call back in a few hours; so far it's been two days with no response. This situation renders our brand new Sun-4 completely useless for the reason we bought it. We desperately need to get it to work. Any suggestions, hints, things to try, wild guesses, etc. will be gratefully received. Dan Franklin dfranklin@bbn.com or dan@bbn.com