Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!sdd.hp.com!elroy.jpl.nasa.gov!ncar!gatech!utkcs2!chili.cs.utk.edu!moore
From: moore@chili.cs.utk.edu (Keith Moore)
Newsgroups: comp.protocols.nfs
Subject: Is there a reliable way to write a file to an NFS server?
Message-ID: <1991Mar25.233245.15209@cs.utk.edu>
Date: 25 Mar 91 23:32:45 GMT
Sender: usenet@cs.utk.edu (USENET News Poster)
Reply-To: moore@cs.utk.edu
Organization: Univ. of Tenn. Computer Science, Knoxville
Lines: 80

Background:

I'm working on a way to distribute electronic mail delivery, in order
to make it more reliable.  Currently we have about 100 machines of
various sizes that all mount /var/spool/mail from one place via NFS.
One consequence of this is that the mail server is a single
point-of-failure -- if it goes down, mail delivery stops for everyone.
Even if it's only down for an hour or two, this is annoying -- our
users expect e-mail to be as reliable as the telephone.  (Not that
they lose messges, they are just annoyed that they aren't delivered
immediately.)

My current idea for a mail delivery scheme replaces the
/var/spool/mail/$user directory with a MESSAGES directory in each
user's home directory.  One or more mail servers (each of which has an
Internet MX record pointing to it) will access a recipient's MESSAGES
directory via NFS, and write a uniquely-named file in that directory
for each message delivered.  We will modify our mail user agents to
read files from this directory rather than from /var/spool/mail/$user.
(A similar scheme used by CMU's Andrew Message Delivery System, which
is normally layered on top of the Andrew File System.)

There are several reasons for delivering the mail on the same file
system as the user's files.  The most important, but least obvious,
reason is that the user's file system must be present anyway before
mail can be delivered, otherwise the user's .forward file might be
ignored.  Delivery to the user's file system thus minimizes the
probability of failure: if the user needs access to both his own files
and to his incoming mail messages in order to work effectively, it's
best if they fail at the same time rather than independently (assuming
the same failure rate for both).

The Problem:

Mail delivery has to be absolutely reliable.  We would like to avoid
delays when possible, and under no circumstances is it acceptable for
the mail system to lose a message during transfer or delivery.

Our NFS load is distributed over several file servers which are used
almost exclusively for NFS access.  Under normal conditions, response
for NFS clients is quite good.  But during periods of peak load,
response degrades to the point at which a single NFS file operation
may require on the order of a minute to complete.  Under these
conditions, I have seen two instances within the last week under which
a file created on a server with the normal creat(), write(), ...,
close() sequence ended up zero length with no indication of error,
even though the return values on all system calls were checked.  I
have observed this on both "real" applications (like vi) and with a
simple C program I wrote.  My suspicion is that the initial NFS
create RPC was duplicated while waiting for the server to respond, and
the server performed one (or more) of them after the file was written.
My workstation is running SunOS 4.1.1, and the server 4.1, so these
are fairly recent implementations of the software.

Even though we could tune our NFS clients, servers, and workload to
avoid these problems to some degree (not that we haven't already done
this), this would only push back the knee of the curve, instead of
eliminating the problem entirely.  Unfortunately, this is not good
enough.  I realize that it will always be possible for the load on a
server to be so high that it cannot service any more requests, but I
would at least expect an error indication in this case.  My mail
delivery agent already has to cope with various temporary failures
(like when the user's file server is down), and this works fine.

So my question to this newsgroup is:

Is there any way I can reliably create a file on an NFS mounted
directory, write its contents, close it, and know whether the file was
written correctly and completely?

The best idea I've had so far is to issue the creat() syscall, then
wait several seconds (time t) to make sure any duplicate NFS create
RPCs have been processed by the server, then to write out the file and
close it (checking return codes, of course).  This might work if I can
come up with reasonable bounds for t.  (How long can one of these
things stay in a server's input queue, anyway?)

--
Keith Moore / U.Tenn CS Dept / 107 Ayres Hall / Knoxville TN  37996-1301
Internet: moore@cs.utk.edu      BITNET: moore@utkvx