Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!bbn!news
From: news@bbn.COM (News system owner ID)
Newsgroups: comp.protocols.nfs
Subject: Re: NFS not idempotent (was re: mountd Performance under Stress)
Message-ID: <45595@bbn.COM>
Date: 14 Sep 89 16:34:20 GMT
References: <14068@bloom-beacon.MIT.EDU> <1211@sequent.cs.qmc.ac.uk>
Reply-To: pplaceway@izar.bbn.com (Paul W. Placeway)
Organization: Bolt Beranek and Newman Inc., Cambridge MA
Lines: 63

liam@cs.qmc.ac.uk (William Roberts) writes:
< jtkohl@athena.mit.edu (John T Kohl) writes:
< > liam@cs.qmc.ac.uk (William Roberts) writes:
< >
< >   This is a difference between user-level RCP and kernel-level RPC.
< >   The kernel level *knows* that its NFS RPC requests are
< >   idempotent
< >
< >Unfortunately, some of them ARE NOT idempotent, and that has caused
< >great troubles to us at MIT.

As near as I can tell, claiming that NFS ops are idempotent is nothing
more than "marketing blurf" from Sun marketing.  The problem is that
Sun didn't think quite hard enough about what a remote file system has
to do before going off and writing one; and now we are stuck with it.

< The server implementations around keep a cache of recent
< requests so that they can recite the previous reply if this
< happens to be a retransmission.

This is what is known generally as "a hack" (a workaround, if you
prefer).

The trouble is, as W.R. points out, it doesn't really work all the
time.  It make the situation better, but it doesn't really cure it.

The really funny thing is that Sun tried using TCP (rather than UDP)
for NFS at first, but it was too slow, so they switched to UDP.  Now,
as part of cleaning up all this mess, they are adding practically all
of the capabilities of TCP to RPC/UDP.  Meanwhile, Jacobson
demonstrated that TCP isn't slow by nature, just by implimentation.

< Question: What's the best way to fix the reordering problem?

Rewrite NFS from scatch, including (_especially_) the entire protocol.
(Only about 1/4 :-) -- it _really_ needs to be done).

< My personal suggestion is to make "significant" operations such
< as create act synchronously, so that the creating process
< cannot issue a subsequent write request before the create has
< definitely occurred.

This is about the same thing that many more simple communications
protocols to (like, say, Kermit): open, close, create, etc. are
synchronous, even if the actual data transmition is async (or
windowed).

Actually, it's kinda odd that given a synchronous RPC system, these
operations were made async anyway.  Of course performance is slower
when creates are synchronous, but which do you want: fast performance,
or a reliable system.

What we really need is for some capable group to go off and write a
networked file sharing system that combines the best features of NFS
(error and failure recovery), (AT&T's) RFS (_full_ Unix semantics,
incl. remote /dev, when talking to another Unix machine), and Andrew
(caching remote mounted files to a more local machine), and make it as
availiable as Sun has with NFS.  Being able to run a pair of servers
as a redundant, reliable, read/write file system would be a nice
bonus.

		-- Paul Placeway <pplaceway@bbn.com>
		   (speaking for myself)