Path: utzoo!attcan!uunet!lll-winken!sun-barr!newstop!sun!ennoyab.Eng.Sun.COM!beepy From: beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) Newsgroups: comp.protocols.nfs Subject: Re: NFS writes and fsync(). Summary: synchronous writes, stateless design Message-ID: <143972@sun.Eng.Sun.COM> Date: 20 Oct 90 02:23:35 GMT References: <1990Oct9.152612@objy.objy.com> <1990Oct16.004225.22754@wrl.dec.com> Sender: news@sun.Eng.Sun.COM Lines: 231 [This is a long response describing some aspects of NFS behaviour on writes. Slightly delayed because of news problems.] In article <1990Oct16.004225.22754@wrl.dec.com>, mogul@wrl.dec.com (Jeffrey Mogul) writes: > One of the problems with NFS is that it manages to tangle up several > different issues, which makes it hard to solve one without breaking > something else. Perhaps because the issues are related. Sprite cache strategies consider data consistency issues, as does AFS 4.0, because aggressive cache coherency strategies must make concessions to data consistency (cooperation amongst clients). I find it incredibly hard to disentangle the two in any discussion, though I find it useful to separate them in looking at problems in distributed file systems. Synchronous Write Behaviour, NFS and Applications ------------------------------------------------- > For example, the reason why NFS clients do write-throughs to the server > is partly for reliability (the client could crash before a delayed > write is sent to the server) and partly a consquence of the statelessness > dogma. This is because if you have two clients sharing the same file, > changes have to appear on the server "as soon as possible" in order > to preserve some shreds of local-Unix-like cache consistency. I believe you've confused me above. An application is still subject to "sync" delays for writes to server via NFS exactly the same as writing to local disk. This is the behaviour we've come to know and love on UNIX. NFS writes are triggered by normal buffer flushing (sync activity) on a client. Your above makes it sound like writing is synchronous to the application. This is, generally, not the case. As a colleague points out: In the normal case, I/O is handled asynchronously subject to the normal update syncs. I would not consider this synchronous to the application! There are cases in NFS when I/O is synchronous (to the application), however: - Any time an fsync is done by the application - For the remaining life of the file descriptor once a lock has been applied to it (even if all locks are then cleared) Is there reason for you to believe it works otherwise than this? NFS client requirements for servers to store data in stable storage before replying NFS_OK is not a result of statelessness "dogma". It is a result of stateless design. Further, client assumption on the writing by servers of data to stable storage (the semantics of a good reply) is unrelated to the 30 second sync time consideration, which as you state, attempts to preserve some shred of local UNIX like data consistency. However, the 30 second sync time more resembles local file write behaviour in UNIX (subject to periodic sync's). I believe you have confused several issues here. Client Requirements for Server Writes to Stable Storage ------------------------------------------------------- More critical to the on-going discussion is the reasons for an NFS client requiring servers to write data (including "meta-data" like file size) to stable storage before replying back to the client. It is an inherent assumption in the current design of NFS that servers will not respond NFS_OK (that is, "Write Successful") until data from a client has been written to stable storage. It is not partly a consequence of the "stateless dogma" but inherently a consequence of the "stateless design" of NFS. The assumption that servers flush data to stable storage before returning NFS_OK to the client has nothing do with client crashes () but has everything to do with the implications of server crashes. By requiring the server to write its data to stable storage, the client need not concern itself with the current server state. On receiving an NFS_OK from the server, the client is free reuse data buffers which held the data just written. If the server crashes and returns (reboots), the client will (in classic "hard" mounted situations) wait for the server to return and continue where it left of. The server crash has not affected the operation of clients. This is some of the behaviour usually implied when people say "NFS is stateless". ["stateless" is a relative term--we're obviously talking about state on a client in the form of buffers held for 30 seconds. This is normal "UNIX" buffering behaviour. There are other "stateless" design implications, the other well-known one being the simple cache coherency strategy used by NFS which results in checking the attributes of a file to validate whether locally cached data in the client is still valid--that is, in agreement with server data. Another is the READDIR cookie, another is the encapsulation of file location information in the file handle. The approach is to keep the server simple, and burdening the client with responsibility of keeping critical state. This critical state is not shared with other clients. Servers are also not without "state"-- servers typically employ a read-ahead strategy to improve performance-- however the key here is that such server state is not critical to proper operation of NFS.] The semantics of an NFS write are to preserve data in event of a server crash (by requiring it ot be on stable storage--static RAM or disk). Suggestions on just allowing servers to return NFS_OK without flushing to stable storage [as have been made in preceding e-mails] are in some sense dangerous. Because all existing clients are implemented under the assumption that NFS servers only reply okay if the data is "safe". {Assuming you didn't just lose the server disk you wrote to during the server crash.} It is a client data reliability issue that it flushes modified buffers every 30 seconds (or so) in exactly the same way it is for flushing buffers to local disk--preserve data in event of a client crash. In this way, NFS is no different from local writing. In addition, 30 second flushes preserve shreds of UNIX data consistency amongst clients (as you mention above). [A useful side effect]. > If NFS clients behaved like local-disk Unix systems (only write dirty > blocks every 30 seconds), then it wouldn't matter as much if the server > acknowledged them immediately, or waited until the data was safely on > the disk. (As has been pointed out, it would be a trivial change to > allow the client to distinguish "precious" blocks from others, just > as the local-disk Unix file system has always done.) But, since server > disk write latency is so nakedly exposed to client applications, anything > that speeds that latency (such as a "stable-storage" cache, or faster > disks) helps a lot. NFS clients do behave like local-disk UNIX systems... What do you mean above--this is where I remain confused? Server disk latency is not exposed so nakedly to a client application. (See above discussion). To help applications detect error in writes, asynchronous write errors (to the execution of the write system call by the application), are returned at close() time. This is why it is so critical for an application to check the results of a "close" operation to detect such errors. I repeat again: NFS writes are not (in general) synchronous from the client application viewpoint, only from the NFS client viewpoint. ---------------------------- -------------------- In effect, a file close() results in an fsync() of the file to ensure that any asycnhronous errors are seen by the application. The current protocol has no provision for later acknowledgement of data being on stable storage (asynchronous writing), allowing the client to implement a "precious block" policy. Such a change would require a protocol revision. What do you consider "precious"? The NFS design considers user data precious, and ensures approximately the same guarantee of reliability to an application that is provided by the local UNIX file system. The semantics of "close" returning any asynchronous write errors (in effect returning following the flush of data to stable storage on the server) provide further guarantees to the application. The attempt is to eliminate inisidious silent errors. Stable storage caching (static RAM techniques) on the server accellerate client applications OVERALL because latency on NFS write requests are reduced (as read-ahead techniques reduce latency by eliminating synchronous disk access, so writing to Static RAM reduces latency by eliminating synchronous disk write activity). The key point here is that no one particular application's write performance is improved, but an OVERAL NFS client's performance is improved (thereby improving all applications). Future Directions ----------------- > Of course, NFS isn't the last word in file systems. Anyone interested > in a better design can read the papers on Sprite (e.g., Michael N. Nelson, > Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File > System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely > NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with > Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988). > > But for many of us (including me!), NFS is what we use, so solutions that > don't require protocol changes (such as server stable-storage boards) might > still be a win. NFS is what we use because it is a solution available commercially today, while the papers you reference above describe research in distributed file systems. I take possible exception to your term "better" design--the NFS design met its goals, provides a good solution, and works. I believe that Sprite, AFS and Spritely NFS have shown a lot of promise. In one form or another, they address the issues of (1) cache coherency, and (2) data consistency. Compare AFS 3.0 and AFS 4.0 and you may arrive at the dichotomy on coherency that helps me understand the differences twixt the two. (Or maybe not.) AFS 4.0 definitely provides stronger data consistency semantics (through the Token Manager) than AFS 3.0 (which had well-defined, but possibly moot cache consistency since the guarantees for data consistency amongst cooperating clients was--is--weak. See the paper by Kazar and crew in Summer Usenix proceedings). AFS, Sprite and Spritely NFS provide direction for us [the NFS community] on ways to improve performance and data consistency guarantees in future distributed file systems. NFS improvements in data consistency (the view of data as seen by multiple clients) are not addressed by stable-storage boards. Stable storage boards provide a performance boost within the framework of the current NFS protocol while preserving correctness (the implicit agreement made bewteen an NFS client and server on write semantics). I think it is time to consider alternative cache consistency models for NFS, and research in the area provides several directions. HOWEVER, I also believe that the simplicity of the current design of NFS, particularly in regards to data reliability, are not things we should toss aside lightly. NFS has been made available on the wide variety of platforms because it has been both easy to port and fairly easy to implement from the specification. Simplicity is not a bad word. Simple error recovery semantics in a distributed application is not a bad design. Complex error recovery techniques may accompany complex cache coherency schemes. There is a body of knowledge now on many of the issues. Perhaps it is time to exploit this knowledge seriously in NFS. 'Lest we lapse into a mode where we believe data and cache consistency are the only issues, one should look around at others: operation over unreliable networks (WAN's), administration, support for shared file name spaces, etc. Feedback on issues are solicited, > -Jeff Brian Pawlowski