Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!sun-barr!newstop!sun!exodus!terra.Eng.Sun.COM!beepy From: beepy@terra.Eng.Sun.COM (Brian Pawlowski) Newsgroups: comp.protocols.nfs Subject: Re: NFS performance Summary: expected failures, synchronous writes, do you feel lucky Message-ID: <15277@exodus.Eng.Sun.COM> Date: 15 Jun 91 14:38:15 GMT References: <1991Jun13.164017.29944@Firewall.Nielsen.Com> <1991Jun13.234448.16172@Firewall.Nielsen.Com> Sender: news@exodus.Eng.Sun.COM Lines: 137 [I lost the original article I was responding to so am replying to a followup] In article <1991Jun13.234448.16172@Firewall.Nielsen.Com>, kdenning@genesis.Naitc.Com (Karl Denninger) writes: > The MIPS systems I've used don't suffer from this problem. Then they probably suffer from other problems:-) That they don't have an export option indicating "safe" and "unsafe" raises a question... > I don't quite understand the fanatacism with which people preach the NFS > stateless nature, O_SYNC and all that. The fact is that a crash of a > LOCAL Unix machine with the normal block buffering scheme can easily cause > the loss of data -- in this case, the write(2) call returned "ok" but it > really might not be "OK"! This is true whether the problem is later found > to be a bad disk sector, the machine panicing, or any one of a number of > other causes. Normal disk I/O on Unix machines is NOT reliable enough to > say "if you get a good return from write(), the data is safely on disk". Good analysis for local operation. I would argue (below) that distributed operation is a little different (particularly in regards to assumptions and expected behaviour during failures of nodes involved as compared to assumptions made when a local component fails). > If you WANT reliable I/O, you open with O_SYNC and take the performance hit. > Why wasn't this option designed into NFS? It could have been set up so that > for Non-Unix clients (which expect reliable I/O and don't have a "buffer > cache" that can be disabled on a file I/O basis) default to O_SYNC mode... > this is easily handled by making the Unix "open()" hook set the "no sync" > flag... > > Or was this a short-cut that has just never been repaired? The "problem" with a distributed file system, as typified by NFS, is that modifying operations on the server held in buffer cache could be lost due to a crash/failure without the client *ever* being aware of it, if the updates are made asynchronously (some time in the future) to the persistent store (disk) following successful acknowledgement to the client. NFS simplifies the (likely) failure modes possible by adding the semantic to a modifying operation (write, create, etc) that the operation has been applied to persistent store. [I believe several studies on the reliability of hosts on the Internet point to non-catastrophic failures--SW failures:-)--as a primary cause of crashes. This semantic for NFS addresses this nicely. While having access to some number N of hosts increases availability of data to a given node, it offers so many more opportunities to experience an unexpected component failure (server crash) to give you a chance to lose data. NFS reduces its "critical state" assumptions, simplifies client-server operation, and I believe increases reliability through the requirement for flush-to-persistent storage.] I think that analogy to the a "local I/O" situation is flawed. A user gets immediate notification of a "OS crash" in the local case because his application crashes too. He has *little expectation* that all his data is safe and will probably take some action to investigate the situation. In the distributed case, things get fuzzier. Assuming for a moment that a given vendor implements buffered writes on an NFS server to increase performance (tossing the synchronous modify semantic), you have now introduced an interesting error class: silent data loss. The scenario introduced is that the server can acknowledge a final write by an application while holding several buffered data blocks for a client queued to write to disk. The server returns "OK", the client application is happy and exits. The server crashes before it is able to flush data. Blissfully unaware, the client (and user) continue working on other files on other servers, and do return to the server in question sometime later after it has rebooted--and lost data the user believes was written to disk on the server. I contend that the user expects the data to be on disk because *he knows* his machine has been running beyond the synchronization time of flushing data to disk from a client's perspective. To find that the data is *lost* some time in the future without having had an intervening client crash introduces an insidious error and I (further) contend violates a basic transparency property provided by NFS (of making remote files seem like local files--to a great degree). NFS does not provide "exact" local file system semantics for UNIX. The original design paper describes decision made in providing semantics and trade-offs to simplify implementation and reduce complexity of error recovery. One could envision a production DFS which buffers data on a server for increased performance in volatile storage. I believe most current (research) systems which do so take a rather cavalier attitude towards ensuring integrity of modified data on behalf of users. I would propose that you would want to introduce recovery mechanisms to allow a client to resubmit lost data due to a server crash--this introduces complex recovery scenarios to a DFS, and was left out of NFS in the original design. [Asynchronous writes after a fashion have been proposed for a protocol revision of NFS... Some time in the hazy future.] Comments on Write Performance for NFS: NFS is not so bad as would be inferred from the above discussion from a client's perspective on writes. A client OS *still* does read-ahead and write-behind for application I/O when talking to an NFS server through the use of BIODs. The close() system call semantic was extended to include a synchronous flush of all dirty modified pages when you close a file which ensures that any errors in flushing modified data to a server will be made available to the application. [The addition of the flush-on-close semantic to support asynchornous error return for NFS was a design trade-off vs. *totally* synchronous writes from the application perspective.] I believe this trade-off gets close to local file expected behaviour and eliminates silent data loss. [For expected likely failures-- SW crashes. Of course a hard disk crash burns everyone--but I believe this is *much less* expected.] This is not to say that write performance for NFS is outstanding:-) I am a proponent of improving write performance beyond current NFS levels. One immediate attack is to install a Presto board (Sun and DEC have this. Others?) Hell, it will accelerate your local synchronous modifying operations (like mkdir, etc). Another attack is to use a product like eNFS for accelerating large file writes. All improvements in this area (as the above solutions do) should recognize that distribution inherently introduces different (more interesting) failure modes, and that I for one (and I believe others) don't appreciate an implementation of a distributed file system which provides me with the wonderful possibilities of silent loss of critical data. > Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 > kdenning@nis.naitc.com Brian Pawlowski last time I looked