Path: utzoo!attcan!uunet!zephyr.ens.tek.com!uw-beaver!cornell!rochester!rutgers!usc!elroy.jpl.nasa.gov!decwrl!sgi!llustig!objy!prefect!peter From: peter@prefect.uucp (Peter Moore) Newsgroups: comp.protocols.nfs Subject: Re: NFS writes and fsync(). Message-ID: <1990Oct14.082712.10811@objy.com> Date: 14 Oct 90 08:27:12 GMT References: <1990Oct9.152612@objy.objy.com> Sender: news@objy.com Organization: Objectivity Inc. Lines: 123 > > I think most people would agree that the default behaviour should be to > make writes reliable, since that provides the semantics of a local > filesystem. > But that isn't the semantics of local write. See below. > > Stop right there. Your 'disk' has just lost data, period. Do you > expect your local disk to ever do that? > Yes I do expect my local disk to do that. As you sort of mention, local writes under almost all Unix systems are asynchronous. The write returns immediately, but the data stays in the buffer pool until either it is pushed out to make room for more active pages or until sync() is called. Typically sync() is called every 30 seconds by the update daemon. So you have no guarantee that your last 30 seconds of local I/O ever make it to disk unless you explicitly do a sync. And if something does go wrong (a unrecoverable bad block, drive off line, or a full crash during the 30 second period), there is no way to signal back to you that it failed. Heck, your process could have exited before the write failed. > The effects could be very > devastating, depending on what exactly cared about the data. Think > of the havoc you could wreak on a database server. > But, as I pointed out, this effect can happen on local writes too. That is why any `database-like' application must explicitly call fsync() if it wishes to guarantee that pages have made it to disk. No recoverable system can depend on the write() alone when writing to local disk. So synchronous NFS isn't helpful to the database people, at least for that reason. They are already doing the right thing with explicit syncs just to make it work locally. This is why synchronous NFS writes seems to be unmotivated to me. It is MORE synchronous than local Unix I/O (assuming that network latency is a lot less than 30 seconds). Why pay such a cost to make it MORE synchronous than we already are willing to live with on Unix? The strongest justification I have been able to synthesize from various sources is basically the original one in my article: 1) Process P on machine L writes to file F on machine R 2) R crashes before syncing the changes to disk and either: a) recovers before P does any more writes to F, or b) crashes after P has finished with F. 3) P continues on blithely unaware that the writes to F failed and produces some data. The reason this scenario is seemingly unique to NFS is because most failures of the local machine to write also involve the local machine crashing. So P couldn't continue after the write really failed. ( Actually this is still possible with local writes: P could either exit before the crash or the write error could be a softer error, perhaps bad block, that didn't force the system to crash. I think it is even possible to have later I/Os make it to disk, but earlier I/Os not. But I do admit this scenario is more likely in the NFS case. ) I have some hand-wavey counter-arguments to the above scenario. 1) Only relatively naive programs will run into this. Reliable programs will already be doing fsync at the proper times. So as long as an asynchronous form of NFS implements fsync, they will be all right. 2) Only long-lived programs are vulnerable (at least more vulnerable than local writes). If a process takes much less that 30 seconds to run, then is very unlikely that the process will be actually killed by a machine crash that wipes out its buffered writes. So if the process worked with local writes, it must either be calling fsync, or already be willing to live with not knowing if its I/O made it to disk. 3) This isn't all that likely. For case 2a) you need R to crash and recover before P does any more I/O to F. Considering that big servers I have worked with can take over half an hour to reboot, that is a very wide window to miss. And for case 2b) R has to crash within 30 seconds of the very last I/O to F; no sooner, no later. 4) Kludges can be added. If NFS handles were invalidated across reboots (perhaps by including a byte computed from the boot time in the handle), then at least 2a) would be impossible without an explicit reopen of F. More complicated support from the local OS could probably make even make reopens of F by P fail (though the vague implementations I can think of are unacceptably kludgey). Now none of these arguments are overwhelming, but they do add up. I am not trying to argue that NO one needs or wants synchronous NFS. I am arguing that not everybody does, (and I believe, but can't defend better than the above, that MOST people don't need it). > I'll add that we do provide an export option to allow you to tell the > server to acknowledge the write request immediately upon receipt, and > spool the request to its local I/O subsystem. It can help performance > a good bit if you don't mind the risks. It's great for filesystems all > clients mount with -soft; their processes will be gone after a server > reboot, anyway. This is exactly the sort of thing I want. Now I just need it on all my machines as an option. Guy Harris mentioned in an email thread of this conversation that a asynchronous extension of NFS has been considered. This would seem the best path, allowing some protocol to negotiate whether the NFS connection will be synced (`nfsmount -o async` perhaps?). This could be a big win for a lot of installations. ( Maybe even linking a RISC application over NFS could finally take a finite amount of time.) Peter Moore peter@objy.com P.S. While I don't know if they even want to be associated with this argument, I would like to thank Craig Everhart, Carl Smith, and Guy Harris for having the patience to discuss much of the above with me in email conversations.