Path: utzoo!attcan!uunet!lll-winken!sun-barr!newstop!sun!ennoyab.Eng.Sun.COM!beepy
From: beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS writes and fsync().
Summary: synchronous writes, stateless design
Message-ID: <143972@sun.Eng.Sun.COM>
Date: 20 Oct 90 02:23:35 GMT
References: <1990Oct9.152612@objy.objy.com> <thurlow.655748135@convex.convex.com> <1990Oct16.004225.22754@wrl.dec.com>
Sender: news@sun.Eng.Sun.COM
Lines: 231

[This is a long response describing some aspects of NFS behaviour
on writes. Slightly delayed because of news problems.]

In article <1990Oct16.004225.22754@wrl.dec.com>,
mogul@wrl.dec.com (Jeffrey Mogul) writes:

> One of the problems with NFS is that it manages to tangle up several
> different issues, which makes it hard to solve one without breaking
> something else.

Perhaps because the issues are related. Sprite cache strategies
consider data consistency issues, as does AFS 4.0, because
aggressive cache coherency strategies must make concessions
to data consistency (cooperation amongst clients).

I find it incredibly hard to disentangle the two in any discussion,
though I find it useful to separate them in looking at problems
in distributed file systems.

Synchronous Write Behaviour, NFS and Applications
-------------------------------------------------

> For example, the reason why NFS clients do write-throughs to the server
> is partly for reliability (the client could crash before a delayed
> write is sent to the server) and partly a consquence of the statelessness
> dogma.  This is because if you have two clients sharing the same file,
> changes have to appear on the server "as soon as possible" in order
> to preserve some shreds of local-Unix-like cache consistency.

I believe you've confused me above. An application is still subject to
"sync" delays for writes to server via NFS exactly the same as writing
to local disk. This is the behaviour we've come to know and love on UNIX.
NFS writes are triggered by normal buffer flushing (sync activity) on a client. 
Your above makes it sound like writing is synchronous to the application. 
This is, generally, not the case.

As a colleague points out:

  In the normal case, I/O is handled asynchronously subject to
  the normal update syncs.  I would not consider this synchronous to the
  application! There are cases in NFS when I/O is synchronous (to the
  application), however:

	- Any time an fsync is done by the application
	- For the remaining life of the file descriptor once a lock
	  has been applied to it (even if all locks are then cleared)

Is there reason for you to believe it works otherwise than this?

NFS client requirements for servers to store data in stable storage
before replying NFS_OK is not a result of statelessness "dogma".
It is a result of stateless design.

Further, client assumption on the writing by servers of data to
stable storage (the semantics of a good reply) is unrelated
to the 30 second sync time consideration, which as you state,
attempts to preserve some shred of local UNIX like data consistency.
However, the 30 second sync time more resembles local file write
behaviour in UNIX (subject to periodic sync's).

I believe you have confused several issues here.

Client Requirements for Server Writes to Stable Storage
-------------------------------------------------------

More critical to the on-going discussion is the reasons for an NFS
client requiring servers to write data (including "meta-data" like
file size) to stable storage before replying back to the client.
It is an inherent assumption in the current design of NFS that servers
will not respond NFS_OK (that is, "Write Successful") until data from 
a client has been written to stable storage. It is not partly a
consequence of the "stateless dogma" but inherently a consequence of 
the "stateless design" of NFS.

The assumption that servers flush data to stable storage before returning
NFS_OK to the client has nothing do with client crashes () but has
everything to do with the implications of server crashes. By requiring
the server to write its data to stable storage, the client need not
concern itself with the current server state. On receiving an NFS_OK from
the server, the client is free reuse data buffers which held the data just
written. If the server crashes and returns (reboots), the client
will (in classic "hard" mounted situations) wait for the server to
return and continue where it left of. The server crash has not affected
the operation of clients. This is some of the behaviour usually implied
when people say "NFS is stateless".

["stateless" is a relative term--we're obviously talking about state
on a client in the form of buffers held for 30 seconds.  This is normal
"UNIX" buffering behaviour. There are other "stateless" design implications,
the other well-known one being the simple cache coherency strategy used
by NFS which results in checking the attributes of a file to validate
whether locally cached data in the client is still valid--that is,
in agreement with server data. Another is the READDIR cookie, another
is the encapsulation of file location information in the file handle.
The approach is to keep the server simple, and burdening the client 
with responsibility of keeping critical state. This critical state
is not shared with other clients. Servers are also not without "state"--
servers typically employ a read-ahead strategy to improve performance--
however the key here is that such server state is not critical to proper
operation of NFS.]

The semantics of an NFS write are to preserve data in event of a server
crash (by requiring it ot be on stable storage--static RAM or disk).

Suggestions on just allowing servers to return NFS_OK without flushing
to stable storage [as have been made in preceding e-mails]
are in some sense dangerous. Because all existing
clients are implemented under the assumption that NFS servers only
reply okay if the data is "safe". {Assuming you didn't just lose
the server disk you wrote to during the server crash.}

It is a client data reliability issue that it flushes modified
buffers every 30 seconds (or so) in exactly the same way it is
for flushing buffers to local disk--preserve data in event of a
client crash. In this way, NFS is no different from local writing.

In addition, 30 second flushes preserve shreds of UNIX data consistency
amongst clients (as you mention above). [A useful side effect].

> If NFS clients behaved like local-disk Unix systems (only write dirty
> blocks every 30 seconds), then it wouldn't matter as much if the server
> acknowledged them immediately, or waited until the data was safely on
> the disk.  (As has been pointed out, it would be a trivial change to
> allow the client to distinguish "precious" blocks from others, just
> as the local-disk Unix file system has always done.)  But, since server
> disk write latency is so nakedly exposed to client applications, anything
> that speeds that latency (such as a "stable-storage" cache, or faster
> disks) helps a lot.

NFS clients do behave like local-disk UNIX systems... What do you mean
above--this is where I remain confused? Server disk latency is
not exposed so nakedly to a client application. (See above discussion).

To help applications detect error in writes, asynchronous write
errors (to the execution of the write system call by the application),
are returned at close() time. This is why it is so critical for an
application to check the results of a "close" operation to detect 
such errors. I repeat again: NFS writes are not (in general) synchronous
from the client application viewpoint, only from the NFS client viewpoint.
         ----------------------------                --------------------

In effect, a file close() results in an fsync() of the file to
ensure that any asycnhronous errors are seen by the application.

The current protocol has no provision for later acknowledgement
of data being on stable storage (asynchronous writing), allowing
the client to implement a "precious block" policy. Such a change
would require a protocol revision.

What do you consider "precious"? The NFS design considers user data
precious, and ensures approximately the same guarantee of reliability
to an application that is provided by the local UNIX file system.
The semantics of "close" returning any asynchronous write errors
(in effect returning following the flush of data to stable storage
on the server) provide further guarantees to the application.

The attempt is to eliminate inisidious silent errors.

Stable storage caching (static RAM techniques) on the server accellerate
client applications OVERALL because latency on NFS write requests
are reduced (as read-ahead techniques reduce latency by eliminating
synchronous disk access, so writing to Static RAM reduces latency
by eliminating synchronous disk write activity). The key point
here is that no one particular application's write performance
is improved, but an OVERAL NFS client's performance is improved
(thereby improving all applications).

Future Directions
-----------------

> Of course, NFS isn't the last word in file systems.  Anyone interested
> in a better design can read the papers on Sprite (e.g., Michael N. Nelson,
> Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File
> System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely
> NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with
> Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988).
> 
> But for many of us (including me!), NFS is what we use, so solutions that
> don't require protocol changes (such as server stable-storage boards) might
> still be a win.

NFS is what we use because it is a solution available commercially
today, while the papers you reference above describe research in 
distributed file systems. I take possible exception to your term
"better" design--the NFS design met its goals, provides a good
solution, and works.

I believe that Sprite, AFS and Spritely NFS have shown a lot of promise.
In one form or another, they address the issues of (1) cache coherency,
and (2) data consistency. Compare AFS 3.0 and AFS 4.0 and you may arrive
at the dichotomy on coherency that helps me understand the differences
twixt the two. (Or maybe not.) AFS 4.0 definitely provides stronger
data consistency semantics (through the Token Manager) than AFS 3.0
(which had well-defined, but possibly moot cache consistency since
the guarantees for data consistency amongst cooperating clients
was--is--weak. See the paper by Kazar and crew in Summer Usenix
proceedings).

AFS, Sprite and Spritely NFS provide direction for us [the NFS community]
on ways to improve performance and data consistency guarantees in
future distributed file systems. NFS improvements in data consistency
(the view of data as seen by multiple clients) are not addressed by 
stable-storage boards. Stable storage boards provide a performance
boost within the framework of the current NFS protocol while preserving 
correctness (the implicit agreement made bewteen an NFS client and server
on write semantics).

I think it is time to consider alternative cache consistency models
for NFS, and research in the area provides several directions. HOWEVER,
I also believe that the simplicity of the current design of NFS,
particularly in regards to data reliability, are not things we should
toss aside lightly. NFS has been made available on the wide
variety of platforms because it has been both easy to port
and fairly easy to implement from the specification.

Simplicity is not a bad word. Simple error recovery semantics in
a distributed application is not a bad design. Complex error recovery
techniques may accompany complex cache coherency schemes. There
is a body of knowledge now on many of the issues. Perhaps it is
time to exploit this knowledge seriously in NFS.

'Lest we lapse into a mode where we believe data and cache consistency 
are the only issues, one should look around at others: operation
over unreliable networks (WAN's), administration, support for shared
file name spaces, etc.

Feedback on issues are solicited,

> -Jeff

Brian Pawlowski