Path: utzoo!attcan!uunet!wuarchive!cs.utexas.edu!sun-barr!newstop!texsun!convex!convex.convex.com!thurlow From: thurlow@convex.com (Robert Thurlow) Newsgroups: comp.protocols.nfs Subject: Re: NFS writes and fsync(). Message-ID: Date: 21 Oct 90 00:02:56 GMT References: <1990Oct9.152612@objy.objy.com> <1990Oct14.082712.10811@objy.com> <1990Oct19.222754.17622@dg-rtp.dg.com> Sender: usenet@convex.com Lines: 55 In <1990Oct19.222754.17622@dg-rtp.dg.com> stukenborg@mavplus9.rtp.dg.com (Stephen Stukenborg) writes: >After reading the postings back and forth on the issue, I still haven't >seen anyone really hit the nail on the head on why nfs write operations >are synchronous. The primary reason is to make the client tolerant of >server crashes. If I'm an NFS client, I don't want the fact that the >server has crashed to impact my "view" of the world. My system is still up >and running. Why should I lose any data? >As has already been described by Rob Thurlow, if I mark my client buffer cache >block as "clean" when the server acks my write, then I'm counting on that >data being on stable storage. If the server is merely going to ack the >receipt of my write request, then I have to hold on to that buffer until >close (or the janitor daemon) verifies that everything is on the server's >disk. As Jeff Mogul pointed out, the primary difference between traditional >unix file system behavior and NFS is that the NFS close operation writes >all of a files dirty buffers to disk. It is this "sync-on-close" behavior >that really dictates the synchronous write policy. (Any also provides a wimpy >consistancy-control policy.) Here's your problem. There is no "open" operation, nor "close" operation, in the NFS protocol. If you want to open a file, your client does an NFS getattr operation to ensure there is indeed a file by that name. If you want to close, you simply stop using that filehandle. All close does is force over-the-wire writes on all VM pages of the associated file. In fact, the only way in the current protocol to send data is with the write operation, and the only way to send metadata is with the set attributes operation. There is no way to have a second ack when I/O is complete. Now, maybe a future protocol will be changed; that writecache from NeFS looks like it has a lot of potential. But right now, if the client doesn't at least feel free to destroy the data when the write ack is received, it never will get any information that will make it feel better about it in the future. >Do users really want a MIPS-like export option that says "don't do sync >writes"? (Note that these async writes are different that those mentioned >above. Now I'm talking about the possibility that data will be lost >on a server crash.) The only reason I can think of having this feature >is for truly wondrous benchmark results that you can wave in a >customer's face. Remember how hard a diskless node hits it's NFS-mounted swap device - I've read numbers akin to five writes to every read from /export/swap. If I'm the sort of user who doesn't run long-lived batch jobs from my workstation, I might enjoy the performance edge I gain with async I/O without minding the cost of rebooting most or all the time when my boot server crashes. #include Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."