Path: utzoo!attcan!uunet!auspex!guy
From: guy@auspex.auspex.com (Guy Harris)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS writes and fsync().
Message-ID: <4190@auspex.auspex.com>
Date: 15 Oct 90 18:03:56 GMT
References: <1990Oct9.152612@objy.objy.com> <thurlow.655748135@convex.convex.com> <1990Oct14.082712.10811@objy.com>
Organization: Auspex Systems, Santa Clara
Lines: 118

>Guy Harris mentioned in an email thread of this conversation that a
>asynchronous extension of NFS has been considered.

Err, umm, no, I didn't.  What I mentioned was the WRITECACHE operation
from an NFS3 protocol spec.  The WRITECACHE operation, as proposed
therein, was (in the RPC language used in that spec, which was the
2/10/89 version):

	NFSPROC_WRITECACHE(file,flush,beginoffset,totalcount,offset,data)
	    returns(reply)

	fhandle file;
	boolean flush;
	fileoffset beginoffset;
	unsigned totalcount;
	fileoffset offset;
	nfsdata data;

	struct wcokres {
		fattr attributes;
		unsigned count;
		union flushinfo switch (bool flushed) {

		case TRUE:
			void;

		case FALSE:
			unsigned flushcnt;
			errinfo flusherror;
		} flushed;
	};

	union writecacheres switch (stat status) {

	case NFS_OK:
		wcokres	ok;

	case NFS_WARN:
		wcokres aok;
		errinfo warning;

	case NFS_ERROR:
		errinfo errinfo;
	} reply;

  This call writes "data" (which is just a bunch of data) into the
  server's data cache for a regular (ftype NFREG) file.  "Beginoffset" and
  "totalcount" describe the offset and size, in bytes, of the entire piece
  of data to be written.  These values will be the same across a set of
  WRITECACHE operations.  Each of the WRITECACHE operations in a set will
  have different values for "offset" and "data".  "Offset" is the byte
  offset into "file" where "data" should be written.

  The boolean "flush", if TRUE, causes the server to flush a whole data
  set.  That is, commit to disk the data from several WRITECACHE
  operations, whose "offset" values fall between "beginoffset" and
  "beginoffset+totalcount".  If the server's attempt to flush "totalcount"
  bytes of data starting at "beginoffset" bytes into the file is
  successful the server will return "reply.flushed" TRUE.

  If "reply.status" is NFS_OK "reply.ok.attributes" is the attributes of
  the file following the write.  If "reply.flushd" is FALSE
  "reply.ok.flushed.flushcnt" is the number of consecutive bytes that
  actually got written starting at "beginoffset", which may be less than
  "totalcount", and "reply.ok.flushed.flusherror" contains error
  information about the write operation that failed.  If "reply.status" is
  NFS_ERROR then the call failed and "reply.errinfo" contains error
  information.  If "reply.status" is NFS_WARN than all the data returned
  in the NFS_OK case and a warning "errinfo" structure is returned.

  The server has the option of accepting only a portion of data.  In this
  case "reply.ok.count" is the number of bytes of data that were cached
  starting at "offset" in the file.  The size of "data" must be less than
  or equal to the value of the "wsize" field in the GETFSINFO reply
  structure [this is the maximum number of bytes the server will accept in
  a "write" operation] for the file system that contains "file".  In
  addition, "totalcount" must be less than or equal to the "wcsize" field
  in the GETFSINFO reply [this is the maximum number of bytes that the
  server will let you write out in a set of WRITECACHE operations].

  IMPLEMENTATION

  The WRITECACHE operation is provided for performance only.  Servers
  are not required to support it.

  Clients can use the WRITECACHE operation to group consecutive WRITE
  operations without incurring the overhead of flushing each chunk to
  data through to disk on the server.  The client takes responsibility
  for recovvering from server errors by holding on to data that has been
  written with WRITECACHE until a successful flush has occurred.  This
  way, to recover from an error the client can either retry the set of
  WRITECACHE operations or use WRITE operations to insure that the data
  is safely on the server's disk.

  The server may pre-flush cached data to disk to free up cache space. 
  If this happens the server can either return an error in response to
  the flush request and force the client to resend everything, or keep
  track of data that has already been flushed when the flush request
  comes along.  This way, if the server can account for all data as
  either in the cache or already flushed the flush request can return
  success.

So this is *NOT* the same as making writes asynchronous.  It's more like
letting a single *synchronous* write be broken up into several pieces.
Those pieces are WRITECACHE operations with the same "file",
"beginoffset", and "totalcount" values, those values being the values
that correspond to the single write.  The individual pieces are
identified by the "offset" values in the WRITECACHE operations, and the
data in the pieces are the "data" values.

The final WRITECACHE operation has a "flush" value of TRUE, the others
having a "flush" value of FALSE.  Just as is the case with a single
"write" operation, the client holds onto the data until it's *all*
flushed to disk; it must not free up stuff just because it's been sent
to the server with a WRITECACHE operation - it has to wait until the
final WRITECACHE operation succeeds.  All but the final WRITECACHE
operation resemble asynchronous writes; the final WRITECACHE is still
synchronous.