Path: utzoo!attcan!uunet!auspex!guy From: guy@auspex.auspex.com (Guy Harris) Newsgroups: comp.protocols.nfs Subject: Re: NFS writes and fsync(). Message-ID: <4190@auspex.auspex.com> Date: 15 Oct 90 18:03:56 GMT References: <1990Oct9.152612@objy.objy.com> <1990Oct14.082712.10811@objy.com> Organization: Auspex Systems, Santa Clara Lines: 118 >Guy Harris mentioned in an email thread of this conversation that a >asynchronous extension of NFS has been considered. Err, umm, no, I didn't. What I mentioned was the WRITECACHE operation from an NFS3 protocol spec. The WRITECACHE operation, as proposed therein, was (in the RPC language used in that spec, which was the 2/10/89 version): NFSPROC_WRITECACHE(file,flush,beginoffset,totalcount,offset,data) returns(reply) fhandle file; boolean flush; fileoffset beginoffset; unsigned totalcount; fileoffset offset; nfsdata data; struct wcokres { fattr attributes; unsigned count; union flushinfo switch (bool flushed) { case TRUE: void; case FALSE: unsigned flushcnt; errinfo flusherror; } flushed; }; union writecacheres switch (stat status) { case NFS_OK: wcokres ok; case NFS_WARN: wcokres aok; errinfo warning; case NFS_ERROR: errinfo errinfo; } reply; This call writes "data" (which is just a bunch of data) into the server's data cache for a regular (ftype NFREG) file. "Beginoffset" and "totalcount" describe the offset and size, in bytes, of the entire piece of data to be written. These values will be the same across a set of WRITECACHE operations. Each of the WRITECACHE operations in a set will have different values for "offset" and "data". "Offset" is the byte offset into "file" where "data" should be written. The boolean "flush", if TRUE, causes the server to flush a whole data set. That is, commit to disk the data from several WRITECACHE operations, whose "offset" values fall between "beginoffset" and "beginoffset+totalcount". If the server's attempt to flush "totalcount" bytes of data starting at "beginoffset" bytes into the file is successful the server will return "reply.flushed" TRUE. If "reply.status" is NFS_OK "reply.ok.attributes" is the attributes of the file following the write. If "reply.flushd" is FALSE "reply.ok.flushed.flushcnt" is the number of consecutive bytes that actually got written starting at "beginoffset", which may be less than "totalcount", and "reply.ok.flushed.flusherror" contains error information about the write operation that failed. If "reply.status" is NFS_ERROR then the call failed and "reply.errinfo" contains error information. If "reply.status" is NFS_WARN than all the data returned in the NFS_OK case and a warning "errinfo" structure is returned. The server has the option of accepting only a portion of data. In this case "reply.ok.count" is the number of bytes of data that were cached starting at "offset" in the file. The size of "data" must be less than or equal to the value of the "wsize" field in the GETFSINFO reply structure [this is the maximum number of bytes the server will accept in a "write" operation] for the file system that contains "file". In addition, "totalcount" must be less than or equal to the "wcsize" field in the GETFSINFO reply [this is the maximum number of bytes that the server will let you write out in a set of WRITECACHE operations]. IMPLEMENTATION The WRITECACHE operation is provided for performance only. Servers are not required to support it. Clients can use the WRITECACHE operation to group consecutive WRITE operations without incurring the overhead of flushing each chunk to data through to disk on the server. The client takes responsibility for recovvering from server errors by holding on to data that has been written with WRITECACHE until a successful flush has occurred. This way, to recover from an error the client can either retry the set of WRITECACHE operations or use WRITE operations to insure that the data is safely on the server's disk. The server may pre-flush cached data to disk to free up cache space. If this happens the server can either return an error in response to the flush request and force the client to resend everything, or keep track of data that has already been flushed when the flush request comes along. This way, if the server can account for all data as either in the cache or already flushed the flush request can return success. So this is *NOT* the same as making writes asynchronous. It's more like letting a single *synchronous* write be broken up into several pieces. Those pieces are WRITECACHE operations with the same "file", "beginoffset", and "totalcount" values, those values being the values that correspond to the single write. The individual pieces are identified by the "offset" values in the WRITECACHE operations, and the data in the pieces are the "data" values. The final WRITECACHE operation has a "flush" value of TRUE, the others having a "flush" value of FALSE. Just as is the case with a single "write" operation, the client holds onto the data until it's *all* flushed to disk; it must not free up stuff just because it's been sent to the server with a WRITECACHE operation - it has to wait until the final WRITECACHE operation succeeds. All but the final WRITECACHE operation resemble asynchronous writes; the final WRITECACHE is still synchronous.