Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!sdd.hp.com!elroy.jpl.nasa.gov!ncar!gatech!utkcs2!chili.cs.utk.edu!moore From: moore@chili.cs.utk.edu (Keith Moore) Newsgroups: comp.protocols.nfs Subject: Is there a reliable way to write a file to an NFS server? Message-ID: <1991Mar25.233245.15209@cs.utk.edu> Date: 25 Mar 91 23:32:45 GMT Sender: usenet@cs.utk.edu (USENET News Poster) Reply-To: moore@cs.utk.edu Organization: Univ. of Tenn. Computer Science, Knoxville Lines: 80 Background: I'm working on a way to distribute electronic mail delivery, in order to make it more reliable. Currently we have about 100 machines of various sizes that all mount /var/spool/mail from one place via NFS. One consequence of this is that the mail server is a single point-of-failure -- if it goes down, mail delivery stops for everyone. Even if it's only down for an hour or two, this is annoying -- our users expect e-mail to be as reliable as the telephone. (Not that they lose messges, they are just annoyed that they aren't delivered immediately.) My current idea for a mail delivery scheme replaces the /var/spool/mail/$user directory with a MESSAGES directory in each user's home directory. One or more mail servers (each of which has an Internet MX record pointing to it) will access a recipient's MESSAGES directory via NFS, and write a uniquely-named file in that directory for each message delivered. We will modify our mail user agents to read files from this directory rather than from /var/spool/mail/$user. (A similar scheme used by CMU's Andrew Message Delivery System, which is normally layered on top of the Andrew File System.) There are several reasons for delivering the mail on the same file system as the user's files. The most important, but least obvious, reason is that the user's file system must be present anyway before mail can be delivered, otherwise the user's .forward file might be ignored. Delivery to the user's file system thus minimizes the probability of failure: if the user needs access to both his own files and to his incoming mail messages in order to work effectively, it's best if they fail at the same time rather than independently (assuming the same failure rate for both). The Problem: Mail delivery has to be absolutely reliable. We would like to avoid delays when possible, and under no circumstances is it acceptable for the mail system to lose a message during transfer or delivery. Our NFS load is distributed over several file servers which are used almost exclusively for NFS access. Under normal conditions, response for NFS clients is quite good. But during periods of peak load, response degrades to the point at which a single NFS file operation may require on the order of a minute to complete. Under these conditions, I have seen two instances within the last week under which a file created on a server with the normal creat(), write(), ..., close() sequence ended up zero length with no indication of error, even though the return values on all system calls were checked. I have observed this on both "real" applications (like vi) and with a simple C program I wrote. My suspicion is that the initial NFS create RPC was duplicated while waiting for the server to respond, and the server performed one (or more) of them after the file was written. My workstation is running SunOS 4.1.1, and the server 4.1, so these are fairly recent implementations of the software. Even though we could tune our NFS clients, servers, and workload to avoid these problems to some degree (not that we haven't already done this), this would only push back the knee of the curve, instead of eliminating the problem entirely. Unfortunately, this is not good enough. I realize that it will always be possible for the load on a server to be so high that it cannot service any more requests, but I would at least expect an error indication in this case. My mail delivery agent already has to cope with various temporary failures (like when the user's file server is down), and this works fine. So my question to this newsgroup is: Is there any way I can reliably create a file on an NFS mounted directory, write its contents, close it, and know whether the file was written correctly and completely? The best idea I've had so far is to issue the creat() syscall, then wait several seconds (time t) to make sure any duplicate NFS create RPCs have been processed by the server, then to write out the file and close it (checking return codes, of course). This might work if I can come up with reasonable bounds for t. (How long can one of these things stay in a server's input queue, anyway?) -- Keith Moore / U.Tenn CS Dept / 107 Ayres Hall / Knoxville TN 37996-1301 Internet: moore@cs.utk.edu BITNET: moore@utkvx