Path: utzoo!attcan!uunet!cs.utexas.edu!sun-barr!newstop!sun!ennoyab.Eng.Sun.COM!beepy
From: beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS writes and fsync().
Summary: stateless dogma, write, synchronous
Message-ID: <143983@sun.Eng.Sun.COM>
Date: 21 Oct 90 05:34:08 GMT
References: <1990Oct9.152612@objy.objy.com> <thurlow.655748135@convex.convex.com> <72781@sgi.sgi.com>
Sender: news@sun.Eng.Sun.COM
Lines: 282

[On several recommendations, I'll try to keep the verbage down. Whoops-
total failure.]

In article <72781@sgi.sgi.com>,
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:

> >                                              It is not partly a
> > consequence of the "stateless dogma" but inherently a consequence of 
> > the "stateless design" of NFS.
> 
> Confounding statelessness, to the limited degree it is an attribute of NFS,
> with server caching policies is bad.  Consider that "state" is the purpose
> of a file system.
> 
> NFS is not now and never was stateless.  It is relatively stateless, in
> that the server is not notified of open()'s, unlike several other remote
> file systems developed then and since.  AT&T made a big deal that NFS was
> "stateless and so bad," while Sun responded that NFS was "stateless and so
> good."  It was blarney in the battle between what AT&T called the "emerging
> network file system standard" (RFS) and NFS.  The battle was not just the
> public one, but the internal one between Sun engineering and whatever you
> call the AT&T New Jersey UNIX department.  (I worked on the first SVR3 NFS
> port in '85 in Mtn.View and saw some of the smoke of the cannons.)

I would not disagree with this. I was simplifying the discussion. Sorry.
No, NFS is not stateless, it is relative (which I attempted to point
at below). The "stateless" wars are pointless; however the fact that
the "relatively stateless" design of NFS has simplified implementations
should not be ignored...

"stateless" is not simply a notification of "open()" though--shared
knowledge on the part of clients and servers (particularly in the
knowledge of cache consistency) is more critical (and difficult) state
to track.

> "XID cache" is vital for making NFS come even as close as it does to real
> UNIX file system symantics, and is by itself a sufficient counter to the
> old claim that "NFS is stateless."

I like your term relatively stateless, particularly for this reason.
However, at the level of the discussion the e-mails I responded to were
at, I felt comfortable pointing out that some of the fundamental design
considerations in NFS have pretty basic implications to what one
can and cannot do in an implementation. The need for an "XID cache"
addresses a "bug" in the protocol. Suggestions to the effect of
eliminating syncing data to stable storage on a server before returning
NFS_OK on a write undermines basic assumptions made by clients.

> > ...
> > The assumption that servers flush data to stable storage before returning
> > NFS_OK to the client has nothing do with client crashes () but has
> > everything to do with the implications of server crashes. By requiring
> > the server to write its data to stable storage, the client need not
> > concern itself with the current server state. On receiving an NFS_OK from
> > the server, the client is free reuse data buffers which held the data just
> > written. If the server crashes and returns (reboots), the client
> > will (in classic "hard" mounted situations) wait for the server to
> > return and continue where it left of. The server crash has not affected
> > the operation of clients. This is some of the behaviour usually implied
> > when people say "NFS is stateless".
> 
> No, the phrase "NFS is stateless" has been almost devoid of meaning for
> years, because it is confounded with the general notion of state, as in
> your paragraph above.

Yes, perhaps I should have moved up the lower paragraph. I understand
and accept the relativity of the term "stateless".

> > ["stateless" is a relative term--we're obviously talking about state
> > on a client in the form of buffers held for 30 seconds.  This is normal
> > "UNIX" buffering behaviour. There are other "stateless" design implications,
> > the other well-known one being the simple cache coherency strategy used
> > by NFS which results in checking the attributes of a file to validate
> > whether locally cached data in the client is still valid--that is,
> > in agreement with server data.
> 
> What has this to do with "statelessness"?  Please say what this "stateless"
> has to do with the differences between the NFS cache coherence mechanism
> and the coherency mechanisms in the distributed cache systems for files,
> RAM, host names, and toaster tempuratures.

Ummmm... This was my small way of saying that a bald statement
of "NFS is stateless" is untrue, putting me in violent agreement with
your "relatively stateless" statement. 

There are a lot of interesting "state" thingies agreed to by the
clients and servers. File handles are agreed to "persist" over a
crash. (Is this in the specification?) The state describing a file
handle in UNIX is information on disk.

> >          ...                      Servers are also not without "state"--
> > servers typically employ a read-ahead strategy to improve performance--
> > however the key here is that such server state is not critical to proper
> > operation of NFS.]
> 
> Wrong.  Without a proper XID cache, an NFS filesystem is an unacceptibly
> poor imitation of a UNIX filesystem.  Remember the problems at the
> Connectathon before last.

Yes, the XID Reply Cache is "highly recommended state":-) for a working
NFS server. Would you allow me to separate out the XID cache solution
from the things like read-ahead which are not required for "proper
operation."

[Also, if you have implemented an NFS server, but don't know what
the XID cache is, look at the Usenix Winter 88 paper on "Improving
Correctness and Performance in an NFS File Server"--I believe that's
the title.]

> Please understand that I like NFS very much and stuff many megabytes thru
> NFS filesystems everyday.  I think the trade-offs of Bob Lyon &co. were
> great and continue to be close to optimimal.  Honesty conflicts with claims
> that NFS==UFS.  There are many common UNIX behaviors where NFS is a poor
> imitation of a Real Filesystem(tm).

Yes. Lyon & Co. made trade-offs. And yes, NFS is not a UNIX file system.
It comes close where it counts for most situations. And when it doesn't
satisfy your requirements (strict read/write consistency without
locking example, append mode writes, ???) then you have a problem.

[Vernon: Do you feel like posting an enumerated prioritized list of
missing features in NFS--with some measure of how important that feature
is? That should start an interesting discussion--I'd like to see it.]

> > The semantics of an NFS write are to preserve data in event of a server
> > crash (by requiring it ot be on stable storage--static RAM or disk).
> > ...
> 
> > Suggestions on just allowing servers to return NFS_OK without flushing
> > to stable storage [as have been made in preceding e-mails]
> > are in some sense dangerous. Because all existing
> > clients are implemented under the assumption that NFS servers only
> > reply okay if the data is "safe". {Assuming you didn't just lose
> > the server disk you wrote to during the server crash.}
> 
> Exactly.  Life is "dangerous" and filled with disk crashes.

Yes, life is dangerous, but because your "server crash" might mean you
lost your disk doesn't lead to the conclusion that you should
implement your NFS server such that it doesn't synchronize NFS writes
to disk because you might lose your disk!!! I have this paranoia that
some people are making that leap in some of the discussions I've seen
and heard. I would postulate that most server crashes don't result
in lost disks, and that clients can continue once the machine comes
back on-line (if the server "dogmatically" flushed the data to
"stable" storage)..

> > ...
> > The semantics of "close" returning any asynchronous write errors
> > (in effect returning following the flush of data to stable storage
> > on the server) provide further guarantees to the application.
> > ...
> > The attempt is to eliminate inisidious silent errors.
>  
> I understand guarantees as absolute, except where explicitly limited.  The
> Federal Government and the State of Calif agree with me.  If something is
> guaranteed to not lose data, then it better not.  The NFS server dogma does
> not provide a valid guanrantee of preserving data, or of no silent errors.
> It only improves the likelihoods.  This is because there is no such thing
> as absolutely stable storage.  (As I write this, I'm restoring a crashed
> disk.)

I agree. I have eliminated "guaranteed" from my vocabulary. Guaranteed.

> In most UNIX systems, the server cache in DRAM is lost during a crash, disk
> sectors are usually not lost, and there is no third medium.  There are
> other possibilities.  In the 1960's I worked with "mainframes" (Kronos on
> 6000's) where you could push the reset button ("level 3 restart"), and not
> only have all active jobs resume, but where the contents of the RAM disk
> caches would be recovered.  Amdhal, Unisys, CDC, and IBM probably still
> have such features.  There are also systems where there are more than 2
> layers of storage.  Where would the NFS server dogma require that a system
> with "permanent" optical storage (whether modern WORM or anchient
> microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind
> fast DRAM, behind SRAM cache preserve client data?  On the most stable,
> even if it takes minutes to write?

Have you or anyone ever seen NFS servers with "intelligent" caching
disk controllers create a "loss of data" problem?

At this point I'm wondering if you are advocating throwing away
the "requirement" for a server to flush to "stable" storage? Are you?

> > Stable storage caching (static RAM techniques) on the server accellerate
> > client applications OVERALL because latency on NFS write requests
> > are reduced (as read-ahead techniques reduce latency by eliminating
> > synchronous disk access, so writing to Static RAM reduces latency
> > by eliminating synchronous disk write activity). The key point
> > here is that no one particular application's write performance
> > is improved, but an OVERAL NFS client's performance is improved
> > (thereby improving all applications).
> > ...
> 
> This is a strange statement.  We found years ago that violating the NFS
> cache dogma improved the numbers on many NFS benchmarks, from the Sun test
> suite to many other benchmarks by 50%.  (Yes, fellow Connectionathon
> attendees, that is one of our secrets, now disclosed in an /etc/exports
> option.)

Maybe I'm being misunderstood. Try again. Using static RAM as "fast"
stable storage as a buffer to disk enables an NFS server to speed
up writes while providing the same level of "assurance" to the NFS
client on the subject of data persistence over a server crash. Violating
the "NFS cache dogma" would increase server write performance
in the same, but with an increase in probability of lost data
if a server crashed. The critical question is "how much is the increase
in failure possibilities--lost data"? Which then leaves you with the
decision of: "How lucky do I feel, given these probabilities?"

Do you mean that SGI was doing this silently (not requiring syncs
to disk) and have now made it an external option? What's the
default? Can you send me the man page describing this option?
You firmly believe this "flush to stable storage" requirement is in 
the realm of dogma?

> It would be less dogmatic to say that when a server returns NFS_OK, it is
> saying that the MTBF of the place containing the client's data is greater
> than XXX, where the MTBF includes all possibilities of failure from power
> to earthquake to kernel bug.  
> 
> The NFS protocol should dictate the external characteristics of the server
> file system, not its internal implementation.  Whether the server flushes
> to disk is an internal implementation issue.

Actually, since we're being honest, we both know the NFS protocol
specification is none too clear on these issues. For instance, the XID 
Reply cache is not specified, whereas you imply that it is a necessary
component of an NFS server implementation (and I would not disagree).
The protocol specification dictates pretty straightforward
external characteristics.

Perhaps I should add for the interested reader that most of what we're
discussing (XID cache, consistency semantics, and other "implementation"
details) are not called out in the protocol specification, but are
merely aspects of particular implementations. The real world intrudes
here. A lot of practical knowledge is exchanged at Connectathon every
year on how to improve implementations. [I'm of the school that
no specification eliminates the need for interoperability testing.
I think Connectathon is one of the very good things done in the
NFS community.]

> Rational ustomers buy solutions to problems.  They don't care about
> violations of dogma.  They only want an appropriate engineering solution to
> preserving their data.  They don't care whether server buffers are flushed
> to disk.  They care only that data are sufficently rarely lost.

Agreed. How does your company ensure (Ah! he artfully avoids the
contentious word "guarantee") "that data are sufficently rarely lost."?
What was the "appropriate engineering solution to preserving their
data" added when the requirement for synchronous writes was dropped?

> I was not present when the NFS cache dogma was graven in stone, but I
> wonder if it was not mostly a statement about the lack of reliability of
> NFS servers of the time (i.e. 68010 UNIX systems in 1984).

Is your basis simply then that today servers are more reliable, and that
in practice this is not a problem? Is server reliability the critical
factor or are external factors like power outages, errant flipping
of power switches, etc. significant? I would assume that disk MTBF's
were much greater than server MTBF's, and synchronous writes exploit
this.

> The NFS cache dogma does solve problems, but those problems are of people
> selling things, not of people building or buying things.

Wow. Wow again. I'm thinking about what everyone is selling (including
you). Forget absolute failure probabilities... Do you have a relative
probability of lost data between flushing to disk and not flushing
to disk on a server before responding to client? Or any failure data?
Because this has obviously been (and seems to be a growing) contentious
point between "strict" (you would say "dogmatic") NFS implementations
and "loose" (would you say "enlightened":-) implementations. Feedback
on how little (or non-existent) a problem this is of great interest to
me. And others, as this seems to be an increasingly polarizing issue.

> Vernon Schryver
> Silicon Graphics
> vjs@sgi.com

Brian Pawlowski
Sun Microsystems
beepy@eng.sun.com