Path: utzoo!attcan!uunet!lll-winken!uwm.edu!wuarchive!julius.cs.uiuc.edu!apple!decwrl!sgi!vjs@rhyolite.wpd.sgi.com
From: vjs@rhyolite.wpd.sgi.com (Vernon Schryver)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS writes and fsync().
Message-ID: <72781@sgi.sgi.com>
Date: 21 Oct 90 00:16:18 GMT
References: <1990Oct9.152612@objy.objy.com> <thurlow.655748135@convex.convex.com> <143972@sun.Eng.Sun.COM>
Sender: guest@sgi.sgi.com
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 158

In article <143972@sun.Eng.Sun.COM>, beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes:
> ...
> More critical to the on-going discussion is the reasons for an NFS
> client requiring servers to write data (including "meta-data" like
> file size) to stable storage before replying back to the client.
> It is an inherent assumption in the current design of NFS that servers
> will not respond NFS_OK (that is, "Write Successful") until data from 
> a client has been written to stable storage. It is not partly a
> consequence of the "stateless dogma" but inherently a consequence of 
> the "stateless design" of NFS.

Confounding statelessness, to the limited degree it is an attribute of NFS,
with server caching policies is bad.  Consider that "state" is the purpose
of a file system.

NFS is not now and never was stateless.  It is relatively stateless, in
that the server is not notified of open()'s, unlike several other remote
file systems developed then and since.  AT&T made a big deal that NFS was
"stateless and so bad," while Sun responded that NFS was "stateless and so
good."  It was blarney in the battle between what AT&T called the "emerging
network file system standard" (RFS) and NFS.  The battle was not just the
public one, but the internal one between Sun engineering and whatever you
call the AT&T New Jersey UNIX department.  (I worked on the first SVR3 NFS
port in '85 in Mtn.View and saw some of the smoke of the cannons.)

"XID cache" is vital for making NFS come even as close as it does to real
UNIX file system symantics, and is by itself a sufficient counter to the
old claim that "NFS is stateless."

> ...
> The assumption that servers flush data to stable storage before returning
> NFS_OK to the client has nothing do with client crashes () but has
> everything to do with the implications of server crashes. By requiring
> the server to write its data to stable storage, the client need not
> concern itself with the current server state. On receiving an NFS_OK from
> the server, the client is free reuse data buffers which held the data just
> written. If the server crashes and returns (reboots), the client
> will (in classic "hard" mounted situations) wait for the server to
> return and continue where it left of. The server crash has not affected
> the operation of clients. This is some of the behaviour usually implied
> when people say "NFS is stateless".

No, the phrase "NFS is stateless" has been almost devoid of meaning for
years, because it is confounded with the general notion of state, as in
your paragraph above.

> ["stateless" is a relative term--we're obviously talking about state
> on a client in the form of buffers held for 30 seconds.  This is normal
> "UNIX" buffering behaviour. There are other "stateless" design implications,
> the other well-known one being the simple cache coherency strategy used
> by NFS which results in checking the attributes of a file to validate
> whether locally cached data in the client is still valid--that is,
> in agreement with server data.

What has this to do with "statelessness"?  Please say what this "stateless"
has to do with the differences between the NFS cache coherence mechanism
and the coherency mechanisms in the distributed cache systems for files,
RAM, host names, and toaster tempuratures.

>          ...                      Servers are also not without "state"--
> servers typically employ a read-ahead strategy to improve performance--
> however the key here is that such server state is not critical to proper
> operation of NFS.]

Wrong.  Without a proper XID cache, an NFS filesystem is an unacceptibly
poor imitation of a UNIX filesystem.  Remember the problems at the
Connectathon before last.

Please understand that I like NFS very much and stuff many megabytes thru
NFS filesystems everyday.  I think the trade-offs of Bob Lyon &co. were
great and continue to be close to optimimal.  Honesty conflicts with claims
that NFS==UFS.  There are many common UNIX behaviors where NFS is a poor
imitation of a Real Filesystem(tm).

> The semantics of an NFS write are to preserve data in event of a server
> crash (by requiring it ot be on stable storage--static RAM or disk).
> ...

> Suggestions on just allowing servers to return NFS_OK without flushing
> to stable storage [as have been made in preceding e-mails]
> are in some sense dangerous. Because all existing
> clients are implemented under the assumption that NFS servers only
> reply okay if the data is "safe". {Assuming you didn't just lose
> the server disk you wrote to during the server crash.}

Exactly.  Life is "dangerous" and filled with disk crashes.

> ...
> The semantics of "close" returning any asynchronous write errors
> (in effect returning following the flush of data to stable storage
> on the server) provide further guarantees to the application.
> ...
> The attempt is to eliminate inisidious silent errors.
 
I understand guarantees as absolute, except where explicitly limited.  The
Federal Government and the State of Calif agree with me.  If something is
guaranteed to not lose data, then it better not.  The NFS server dogma does
not provide a valid guanrantee of preserving data, or of no silent errors.
It only improves the likelihoods.  This is because there is no such thing
as absolutely stable storage.  (As I write this, I'm restoring a crashed
disk.)

In most UNIX systems, the server cache in DRAM is lost during a crash, disk
sectors are usually not lost, and there is no third medium.  There are
other possibilities.  In the 1960's I worked with "mainframes" (Kronos on
6000's) where you could push the reset button ("level 3 restart"), and not
only have all active jobs resume, but where the contents of the RAM disk
caches would be recovered.  Amdhal, Unisys, CDC, and IBM probably still
have such features.  There are also systems where there are more than 2
layers of storage.  Where would the NFS server dogma require that a system
with "permanent" optical storage (whether modern WORM or anchient
microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind
fast DRAM, behind SRAM cache preserve client data?  On the most stable,
even if it takes minutes to write?

> Stable storage caching (static RAM techniques) on the server accellerate
> client applications OVERALL because latency on NFS write requests
> are reduced (as read-ahead techniques reduce latency by eliminating
> synchronous disk access, so writing to Static RAM reduces latency
> by eliminating synchronous disk write activity). The key point
> here is that no one particular application's write performance
> is improved, but an OVERAL NFS client's performance is improved
> (thereby improving all applications).
> ...

This is a strange statement.  We found years ago that violating the NFS
cache dogma improved the numbers on many NFS benchmarks, from the Sun test
suite to many other benchmarks by 50%.  (Yes, fellow Connectionathon
attendees, that is one of our secrets, now disclosed in an /etc/exports
option.)


It would be less dogmatic to say that when a server returns NFS_OK, it is
saying that the MTBF of the place containing the client's data is greater
than XXX, where the MTBF includes all possibilities of failure from power
to earthquake to kernel bug.  

The NFS protocol should dictate the external characteristics of the server
file system, not its internal implementation.  Whether the server flushes
to disk is an internal implementation issue.

Rational ustomers buy solutions to problems.  They don't care about
violations of dogma.  They only want an appropriate engineering solution to
preserving their data.  They don't care whether server buffers are flushed
to disk.  They care only that data are sufficently rarely lost.

I was not present when the NFS cache dogma was graven in stone, but I
wonder if it was not mostly a statement about the lack of reliability of
NFS servers of the time (i.e. 68010 UNIX systems in 1984).

The NFS cache dogma does solve problems, but those problems are of people
selling things, not of people building or buying things.


Vernon Schryver
Silicon Graphics
vjs@sgi.com