Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!sun-barr!newstop!sun!exodus!terra.Eng.Sun.COM!beepy
From: beepy@terra.Eng.Sun.COM (Brian Pawlowski)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS performance
Summary: expected failures, synchronous writes, do you feel lucky
Message-ID: <15277@exodus.Eng.Sun.COM>
Date: 15 Jun 91 14:38:15 GMT
References: <1991Jun13.164017.29944@Firewall.Nielsen.Com> <1991Jun13.234448.16172@Firewall.Nielsen.Com>
Sender: news@exodus.Eng.Sun.COM
Lines: 137

[I lost the original article I was responding to so am
replying to a followup]

In article <1991Jun13.234448.16172@Firewall.Nielsen.Com>, kdenning@genesis.Naitc.Com (Karl Denninger) writes:

> The MIPS systems I've used don't suffer from this problem.

Then they probably suffer from other problems:-) That they don't
have an export option indicating "safe" and "unsafe" raises a
question...
                                                
> I don't quite understand the fanatacism with which people preach the NFS
> stateless nature, O_SYNC and all that.  The fact is that a crash of a
> LOCAL Unix machine with the normal block buffering scheme can easily cause
> the loss of data -- in this case, the write(2) call returned "ok" but it
> really might not be "OK"!  This is true whether the problem is later found
> to be a bad disk sector, the machine panicing, or any one of a number of
> other causes.  Normal disk I/O on Unix machines is NOT reliable enough to
> say "if you get a good return from write(), the data is safely on disk".
 
Good analysis for local operation. I would argue (below) that
distributed operation is a little different (particularly in regards to
assumptions and expected behaviour during failures of nodes involved as
compared to assumptions made when a local component fails).
              
> If you WANT reliable I/O, you open with O_SYNC and take the performance hit.   > Why wasn't this option designed into NFS?  It could have been set up so that
> for Non-Unix clients (which expect reliable I/O and don't have a "buffer
> cache" that can be disabled on a file I/O basis) default to O_SYNC mode...
> this is easily handled by making the Unix "open()" hook set the "no sync"
> flag...
>
> Or was this a short-cut that has just never been repaired?
 
The "problem" with a distributed file system, as typified by NFS, is that
modifying operations on the server held in buffer cache could          
be lost due to a crash/failure without the client *ever* being aware
of it, if the updates are made asynchronously (some time in the future)
to the persistent store (disk) following successful acknowledgement to
the client. NFS simplifies the (likely) failure modes possible by
adding the semantic to a modifying operation (write, create, etc) that 
the operation has been applied to persistent store.

[I believe several studies on the reliability of hosts on the Internet
point to non-catastrophic failures--SW failures:-)--as a primary
cause of crashes. This semantic for NFS addresses this nicely. While
having access to some number N of hosts increases availability
of data to a given node, it offers so many more opportunities
to experience an unexpected component failure (server crash) to
give you a chance to lose data. NFS reduces its "critical state" assumptions,
simplifies client-server operation, and I believe increases reliability
through the requirement for flush-to-persistent storage.]

I think that analogy to the a "local I/O" situation is flawed.
A user gets immediate notification of a "OS crash" in the
local case because his application crashes too. He has *little
expectation* that all his data is safe and will probably take
some action to investigate the situation.

In the distributed case, things get fuzzier. Assuming for a moment
that a given vendor implements buffered writes on an NFS server
to increase performance (tossing the synchronous modify semantic),
you have now introduced an interesting error class: silent data
loss. The scenario introduced is that the server can acknowledge
a final write by an application while holding several buffered
data blocks for a client queued to write to disk. The server
returns "OK", the client application is happy and exits. The server
crashes before it is able to flush data.

Blissfully unaware, the client (and user) continue working on other
files on other servers, and do return to the server in question
sometime later after it has rebooted--and lost data the user believes
was written to disk on the server.

I contend that the user expects the data to be on disk because
*he knows* his machine has been running beyond the synchronization time
of flushing data to disk from a client's perspective. To find
that the data is *lost* some time in the future without having
had an intervening client crash introduces an insidious error
and I (further) contend violates a basic transparency property
provided by NFS (of making remote files seem like local files--to
a great degree).

NFS does not provide "exact" local file system semantics for UNIX.
The original design paper describes decision made in providing
semantics and trade-offs to simplify implementation and reduce complexity
of error recovery. One could envision a production DFS which buffers
data on a server for increased performance in volatile storage.
I believe most current (research) systems which do so take a
rather cavalier attitude towards ensuring integrity of modified
data on behalf of users. I would propose that you would want
to introduce recovery mechanisms to allow a client to resubmit
lost data due to a server crash--this introduces complex recovery
scenarios to a DFS, and was left out of NFS in the original design.
[Asynchronous writes after a fashion have been proposed for
a protocol revision of NFS... Some time in the hazy future.]

Comments on Write Performance for NFS:

NFS is not so bad as would be inferred from the above discussion
from a client's perspective on writes.

A client OS *still* does read-ahead and write-behind for application
I/O when talking to an NFS server through the use of BIODs.        
The close() system call semantic was extended to include a synchronous
flush of all dirty modified pages when you close a file which
ensures that any errors in flushing modified data to a server
will be made available to the application. [The addition of the
flush-on-close semantic to support asynchornous error return for
NFS was a design trade-off vs. *totally* synchronous writes
from the application perspective.]

I believe this trade-off gets close to local file expected behaviour
and eliminates silent data loss. [For expected likely failures--
SW crashes. Of course a hard disk crash burns everyone--but I
believe this is *much less* expected.]

This is not to say that write performance for NFS is outstanding:-) 
I am a proponent of improving write performance beyond current NFS
levels.                      
                                      
One immediate attack is to install a Presto board (Sun and DEC
have this. Others?) Hell, it will accelerate your local synchronous
modifying operations (like mkdir, etc). Another attack is
to use a product like eNFS for accelerating large file writes.

All improvements in this area (as the above solutions do) should
recognize that distribution inherently introduces different (more
interesting) failure modes, and that I for one (and I believe others)       
don't appreciate an implementation of a distributed file
system which provides me with the wonderful possibilities
of silent loss of critical data.                         
                                
> Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
> kdenning@nis.naitc.com

Brian Pawlowski
last time I looked