Newsgroups: comp.protocols.nfs
Path: utzoo!utgpu!watserv1!watmath!att!linac!Firewall!genesis!kdenning
From: kdenning@genesis.Naitc.Com (Karl Denninger)
Subject: Re: NFS performance
Message-ID: <1991Jun14.222604.13965@Firewall.Nielsen.Com>
Summary: More discussion and potential "solutions" (or a good shot at same)
Sender: news@Firewall.Nielsen.Com (Usenet News)
Nntp-Posting-Host: genesis.naitc.com
Organization: AC Nielsen Co., Bannockburn IL
References: <1991Jun13.234448.16172@Firewall.Nielsen.Com> <DROMS.91Jun14092449@regulus.bucknell.edu> <6743@eastapps.East.Sun.COM>
Date: Fri, 14 Jun 91 22:26:04 GMT

In article <6743@eastapps.East.Sun.COM> geoff@east.sun.com (Geoff Arnold @ Sun BOS - R.H. coast near the top) writes:
>Quoth droms@bucknell.edu (in <DROMS.91Jun14092449@regulus.bucknell.edu>):
>#In article <1991Jun13.234448.16172@Firewall.Nielsen.Com> kdenning@genesis.Naitc.Com (Karl Denninger) writes:
>#
>#   >
>#   >If the server ACKs the data before writing it to disk, there is a window
>#   >during which the server can crash.  The data is then lost.  
>#
>#   How does this differ from the standard "Unix" way of doing file I/O, which
>#   returns a successful reply from a write call before the data is safely on
>#   disk?  .....
>#
>#I think the difference lies in the feedback to the user.  If the local
>#UNIX box crashes, the user is aware "something is wrong" immediately.
>#If the server crashes and reboots, the data can be lost silently...
>
>It's more than simply a vague "feedback to the user": it's a
>question of what assertions can be made about the correctness
>of file system operations. Even though normal buffer cache
>operations can reorder some kinds of operation, I can code something
>like
>
>	write(file1, data1)
>	fsync(file1)
>	write(file2, "file1 was written successfully")
>
>(with appropriate error checking) and be confident that file2 will
>be written if and only if file1 was written. Karl's "standard Unix way"
>doesn't apply here: if the machine crashes, the process will crash
>with it. If an NFS server could ack the first write (but not
>commit it to stable storage), then crash and reboot, the failure
>of the write would be undetectable.

Understood.  However, the issue is data loss, not reboot-n-continue
behavior or whether the process dies along with the machine.  If you 
soft mount directories (yes, I know this is dangerous) your process will 
get an I/O failure if the server goes down -- indicating that you have 
lost >something<.

Data loss is data loss -- with or without the process continuing to exist.

I would think that the real solution here would be to have a crashed and
rebooted server return some form of error on the next I/O request (what, I 
don't know offhand, perhaps ENXIO) if you are mounted async and the server 
crashes and reboots.  At least you'd be notified that there is a potential 
data integrity problem that your software needs to investigate or report.

>The decision as to whether data should be written "safely" or not should
>logically rest with the client, not the server. This is why the
>hack of an async server side configuration option is so dangerous.
>The correct approach, of course, is the (unimplemented) RFS_WRITECACHE
>NFS function.... >sigh< But for now, Prestoserve is the best solution.
>--Geoff Arnold, PC-NFS architect(geoff@East.Sun.COM or geoff.arnold@Sun.COM)--

AGREED.  The decision SHOULD be with the client.  I believe that many
systems would opt for the async choice, but I disagree with making it
something you don't have control over at the client level.

One other option would be to have fsync() on an NFS file return success 
only if all operations since the last fsync() or open() had succeeded.  
A crash is an exception condition here, since the client will not have 
executed an open() prior to the fsync() -- thus, in that case fsync() 
would return failure.  If the client opens with O_SYNC, then you do only
sync I/O.  On a close() do an implied fsync(), and again return success 
only if all data "makes it".

This does require keeping one bit of state around -- whether or not an
"open" or "fsync" has been executed (a noted I/O error rates a "no" to that
question).

This is very close to the semantics of a local filesystem, and should be
pretty easy to do.  It also doesn't affect anything on existing software 
(except that reliability for programs that don't do a fsync() or check 
close() return values are at risk, but on a local disk in this case they 
would be too!)  This is what one would expect on a local disk in the event 
of a disk failure -- if you didn't check close()'s return value you might 
mistakenly think your data all got there when it didn't.

Prestoserve is not a total safety net -- it's hardware, and CAN fail.  The
risks there are exactly the same as a crash/disk failure/whatever.  The 
only real saving grace there is that it doesn't fail often, having no 
moving parts.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.