Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!strath-cs!baird!jim
From: jim@cs.strath.ac.uk (Jim Reid)
Newsgroups: comp.protocols.nfs
Subject: Re: SCO Unix problem with large executables over NFS?
Message-ID: <JIM.90Sep18143255@baird.cs.strath.ac.uk>
Date: 18 Sep 90 13:32:55 GMT
References: <15814@know.pws.bull.com>
Sender: jim@cs.strath.ac.uk
Organization: Computer Science Dept., Strathclyde Univ., Glasgow, Scotland.
Lines: 59
In-reply-to: eli's message of 13 Sep 90 13:08:01 GMT

In article <15814@know.pws.bull.com> eli (Steve Elias) writes:

   can anyone confirm or deny the following as a bug or feature?

   when we run large executables off of an NFS mounted drive under SCO
   Unix, the process will sometimes die off randomly, occasionally
   reporting that it has received a kill signal.  we've seen this
   behavior with both gnu emacs and a large document prep system.

   theory #1.  ahem.  ahem.  theory #1, which is mine (ours):

   when pagedaemon (or appropriate kernel portion) tries to page in some
   requested text pages across NFS, and the network drops packet(s),
   pagedaemon sends a kill signal to the process which needs the text
   page.  restated: something causes the process to die ungracefully if
   it can't get its requested page fast enough across NFS.  perhaps there
   is some sort of "retry" parameter which can be adjusted.  i've never
   seen this behavior on either HPs or Suns running executables across
   NFS, so i doubt this is supposed to be happening.

   can any of yall confirm or deny this behavior and/or theory?

You're on the right lines, but not quite correct.

If an NFS client pages across the network, it will be suspended until
the NFS action completes. It will not continue executing until the
data is read from the server or written to the server and an NFS reply
returned. (There's no question of not getting the page "fast enough".
It has to wait until the page arrives. What can be a problem is the
client and server dropping too many packets because of a mismatch in
the throughput of the ethernet interfaces and protocol handling code.)

In the case of paging in, the underlying transport protocol (UDP) will
put together the NFS "packet" from the server before handing it off to
NFS and then back to the suspended user process. If your network drops
packets, the client won't be able to re-assemble the data, so the
server retransmits the data. [To be more precise, the client
retransmits the same request and the server sends the data again.]
Eventually the client gets all the data it had asked for and the
kernel returns the page(s) to the waiting process.

Problems arise if the filesystem is soft mounted. If it was hard
mounted, clients and servers retransmit forever until success is
achieved. If soft mounted, NFS can return an error after some number
of retries and/or a timeout limit have been reached. This 'cannot
happen': it's akin to getting an error from a disk read or write
request. In theory, this should send a signal to the process which
causes it to terminate (a swap error has occurred). Some NFS
implementations apparently silently ignore the error and return a page
of null bytes to the user process! This may cause an immediate core
dump - illegal instruction or a segmentation violation. If you're
unlucky, the process gets a page of null data and doesn't realise it
until some time later.

In short, the answer is to hard mount your filesystems. It is a good
idea to do this anyway. Soft mounts don't buy you any worthwhile
advantages and can cause a lot of unpredictable trouble.

		Jim