Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!strath-cs!baird!jim From: jim@cs.strath.ac.uk (Jim Reid) Newsgroups: comp.protocols.nfs Subject: Re: SCO Unix problem with large executables over NFS? Message-ID: Date: 18 Sep 90 13:32:55 GMT References: <15814@know.pws.bull.com> Sender: jim@cs.strath.ac.uk Organization: Computer Science Dept., Strathclyde Univ., Glasgow, Scotland. Lines: 59 In-reply-to: eli's message of 13 Sep 90 13:08:01 GMT In article <15814@know.pws.bull.com> eli (Steve Elias) writes: can anyone confirm or deny the following as a bug or feature? when we run large executables off of an NFS mounted drive under SCO Unix, the process will sometimes die off randomly, occasionally reporting that it has received a kill signal. we've seen this behavior with both gnu emacs and a large document prep system. theory #1. ahem. ahem. theory #1, which is mine (ours): when pagedaemon (or appropriate kernel portion) tries to page in some requested text pages across NFS, and the network drops packet(s), pagedaemon sends a kill signal to the process which needs the text page. restated: something causes the process to die ungracefully if it can't get its requested page fast enough across NFS. perhaps there is some sort of "retry" parameter which can be adjusted. i've never seen this behavior on either HPs or Suns running executables across NFS, so i doubt this is supposed to be happening. can any of yall confirm or deny this behavior and/or theory? You're on the right lines, but not quite correct. If an NFS client pages across the network, it will be suspended until the NFS action completes. It will not continue executing until the data is read from the server or written to the server and an NFS reply returned. (There's no question of not getting the page "fast enough". It has to wait until the page arrives. What can be a problem is the client and server dropping too many packets because of a mismatch in the throughput of the ethernet interfaces and protocol handling code.) In the case of paging in, the underlying transport protocol (UDP) will put together the NFS "packet" from the server before handing it off to NFS and then back to the suspended user process. If your network drops packets, the client won't be able to re-assemble the data, so the server retransmits the data. [To be more precise, the client retransmits the same request and the server sends the data again.] Eventually the client gets all the data it had asked for and the kernel returns the page(s) to the waiting process. Problems arise if the filesystem is soft mounted. If it was hard mounted, clients and servers retransmit forever until success is achieved. If soft mounted, NFS can return an error after some number of retries and/or a timeout limit have been reached. This 'cannot happen': it's akin to getting an error from a disk read or write request. In theory, this should send a signal to the process which causes it to terminate (a swap error has occurred). Some NFS implementations apparently silently ignore the error and return a page of null bytes to the user process! This may cause an immediate core dump - illegal instruction or a segmentation violation. If you're unlucky, the process gets a page of null data and doesn't realise it until some time later. In short, the answer is to hard mount your filesystems. It is a good idea to do this anyway. Soft mounts don't buy you any worthwhile advantages and can cause a lot of unpredictable trouble. Jim