Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!bbn!news From: news@bbn.COM (News system owner ID) Newsgroups: comp.protocols.nfs Subject: Re: NFS not idempotent (was re: mountd Performance under Stress) Message-ID: <45595@bbn.COM> Date: 14 Sep 89 16:34:20 GMT References: <14068@bloom-beacon.MIT.EDU> <1211@sequent.cs.qmc.ac.uk> Reply-To: pplaceway@izar.bbn.com (Paul W. Placeway) Organization: Bolt Beranek and Newman Inc., Cambridge MA Lines: 63 liam@cs.qmc.ac.uk (William Roberts) writes: < jtkohl@athena.mit.edu (John T Kohl) writes: < > liam@cs.qmc.ac.uk (William Roberts) writes: < > < > This is a difference between user-level RCP and kernel-level RPC. < > The kernel level *knows* that its NFS RPC requests are < > idempotent < > < >Unfortunately, some of them ARE NOT idempotent, and that has caused < >great troubles to us at MIT. As near as I can tell, claiming that NFS ops are idempotent is nothing more than "marketing blurf" from Sun marketing. The problem is that Sun didn't think quite hard enough about what a remote file system has to do before going off and writing one; and now we are stuck with it. < The server implementations around keep a cache of recent < requests so that they can recite the previous reply if this < happens to be a retransmission. This is what is known generally as "a hack" (a workaround, if you prefer). The trouble is, as W.R. points out, it doesn't really work all the time. It make the situation better, but it doesn't really cure it. The really funny thing is that Sun tried using TCP (rather than UDP) for NFS at first, but it was too slow, so they switched to UDP. Now, as part of cleaning up all this mess, they are adding practically all of the capabilities of TCP to RPC/UDP. Meanwhile, Jacobson demonstrated that TCP isn't slow by nature, just by implimentation. < Question: What's the best way to fix the reordering problem? Rewrite NFS from scatch, including (_especially_) the entire protocol. (Only about 1/4 :-) -- it _really_ needs to be done). < My personal suggestion is to make "significant" operations such < as create act synchronously, so that the creating process < cannot issue a subsequent write request before the create has < definitely occurred. This is about the same thing that many more simple communications protocols to (like, say, Kermit): open, close, create, etc. are synchronous, even if the actual data transmition is async (or windowed). Actually, it's kinda odd that given a synchronous RPC system, these operations were made async anyway. Of course performance is slower when creates are synchronous, but which do you want: fast performance, or a reliable system. What we really need is for some capable group to go off and write a networked file sharing system that combines the best features of NFS (error and failure recovery), (AT&T's) RFS (_full_ Unix semantics, incl. remote /dev, when talking to another Unix machine), and Andrew (caching remote mounted files to a more local machine), and make it as availiable as Sun has with NFS. Being able to run a pair of servers as a redundant, reliable, read/write file system would be a nice bonus. -- Paul Placeway (speaking for myself)