Path: utzoo!attcan!uunet!zephyr.ens.tek.com!uw-beaver!cornell!rochester!rutgers!usc!elroy.jpl.nasa.gov!decwrl!sgi!llustig!objy!prefect!peter
From: peter@prefect.uucp (Peter Moore)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS writes and fsync().
Message-ID: <1990Oct14.082712.10811@objy.com>
Date: 14 Oct 90 08:27:12 GMT
References: <1990Oct9.152612@objy.objy.com> <thurlow.655748135@convex.convex.com>
Sender: news@objy.com
Organization: Objectivity Inc.
Lines: 123


> 
> I think most people would agree that the default behaviour should be to
> make writes reliable, since that provides the semantics of a local
> filesystem.  
>

But that isn't the semantics of local write.  See below.

> 
> Stop right there.  Your 'disk' has just lost data, period.  Do you
> expect your local disk to ever do that? 
> 

Yes I do expect my local disk to do that.  As you sort of mention,
local writes under almost all Unix systems are asynchronous.  The
write returns immediately, but the data stays in the buffer pool until
either it is pushed out to make room for more active pages or until
sync() is called.  Typically sync() is called every 30 seconds by the
update daemon.  So you have no guarantee that your last 30 seconds of
local I/O ever make it to disk unless you explicitly do a sync.  And if
something does go wrong (a unrecoverable bad block, drive off line, or
a full crash during the 30 second period), there is no way to signal
back to you that it failed.  Heck, your process could have exited
before the write failed.

>   The effects could be very
> devastating, depending on what exactly cared about the data.  Think
> of the havoc you could wreak on a database server.
>

But, as I pointed out, this effect can happen on local writes too.
That is why any `database-like' application must explicitly call
fsync() if it wishes to guarantee that pages have made it to disk. No
recoverable system can depend on the write() alone when writing to
local disk.  So synchronous NFS isn't helpful to the database people, at
least for that reason.  They are already doing the right thing with
explicit syncs just to make it work locally.

This is why synchronous NFS writes seems to be unmotivated to me.
It is MORE synchronous than local Unix I/O (assuming that network
latency is a lot less than 30 seconds).  Why pay such a cost to make
it MORE synchronous than we already are willing to live with on Unix?

The strongest justification I have been able to synthesize from
various sources is basically the original one in my article:

1) Process P on machine L writes to file F on machine R
2) R crashes before syncing the changes to disk and either:
     a) recovers before P does any more writes to F, or
     b) crashes after P has finished with F.
3) P continues on blithely unaware that the writes to F failed and
    produces some data.

The reason this scenario is seemingly unique to NFS is because most
failures of the local machine to write also involve the local machine
crashing.  So P couldn't continue after the write really failed.

    ( Actually this is still possible with local writes: P could either
    exit before the crash or the write error could be a softer error,
    perhaps bad block, that didn't force the system to crash.  I think it
    is even possible to have later I/Os make it to disk, but earlier I/Os not.
    But I do admit this scenario is more likely in the NFS case. )

I have some hand-wavey counter-arguments to the above scenario.

   1) Only relatively naive programs will run into this.  Reliable
      programs will already be doing fsync at the proper times.  So as
      long as an asynchronous form of NFS implements fsync, they will
      be all right.

   2) Only long-lived programs are vulnerable (at least more
      vulnerable than local writes).  If a process takes much less
      that 30 seconds to run, then is very unlikely that the process
      will be actually killed by a machine crash that wipes out its
      buffered writes.  So if the process worked with local writes, it
      must either be calling fsync, or already be willing to live
      with not knowing if its I/O made it to disk.
      
   3) This isn't all that likely.  For case 2a) you need R to crash
      and recover before P does any more I/O to F.  Considering that big
      servers I have worked with can take over half an hour to reboot,
      that is a very wide window to miss. And for case 2b) R has to
      crash within 30 seconds of the very last I/O to F; no sooner, no later.

   4) Kludges can be added.  If NFS handles were invalidated across
      reboots (perhaps by including a byte computed from the boot time in
      the handle), then at least 2a) would be impossible without an
      explicit reopen of F.  More complicated support from the local
      OS could probably make even make reopens of F by P fail (though
      the vague implementations I can think of are unacceptably kludgey).

Now none of these arguments are overwhelming, but they do add up.  I
am not trying to argue that NO one needs or wants synchronous NFS.  I
am arguing that not everybody does, (and I believe, but can't defend
better than the above, that MOST people don't need it).

> I'll add that we do provide an export option to allow you to tell the
> server to acknowledge the write request immediately upon receipt, and
> spool the request to its local I/O subsystem.  It can help performance
> a good bit if you don't mind the risks.  It's great for filesystems all
> clients mount with -soft; their processes will be gone after a server
> reboot, anyway.

This is exactly the sort of thing I want.  Now I just need it on all
my machines as an option.  Guy Harris mentioned in an email thread of
this conversation that a asynchronous extension of NFS has been
considered.  This would seem the best path, allowing some protocol to
negotiate whether the NFS connection will be synced (`nfsmount -o
async` perhaps?).  This could be a big win for a lot of installations.
( Maybe even linking a RISC application over NFS could finally take a
finite amount of time.)

     Peter Moore
     peter@objy.com

        
P.S.
     While I don't know if they even want to be associated with this
     argument, I would like to thank Craig Everhart, Carl Smith, and
     Guy Harris for having the patience to discuss much of the above
     with me in email conversations.