Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!ucsd!tut.cis.ohio-state.edu!brutus.cs.uiuc.edu!wuarchive!wugate!uunet!mcsun!ukc!icdoc!qmc-cs!liam
From: liam@cs.qmc.ac.uk (William Roberts)
Newsgroups: comp.protocols.nfs
Subject: Re: mountd Performance under Stress
Summary: Race condition + rmtab considered harmful
Keywords: mountd nfs performance
Message-ID: <1199@sequent.cs.qmc.ac.uk>
Date: 31 Aug 89 19:16:59 GMT
References: <1577@dsacg3.UUCP> <34283@apple.Apple.COM>
Reply-To: liam@cs.qmc.ac.uk (William Roberts)
Organization: Computer Science Dept, Queen Mary College, University of London, UK.
Lines: 73
Expires:
Sender:
Followup-To:
Distribution:

>>On occasion we can have quite a few users issuing multiple mount
>>requests simultaneously. When this happens we see some of the requests time
>>out, while users accessing already mounted files continue to receive good
>>service.

This is a difference between user-level RCP and kernel-level RPC.
The kernel level *knows* that its NFS RPC requests are
idempotent and so it doesn't change the xid when it does
sends a retransmission. This means that the first reply is
acceptable no matter how many retransmissions have occurred.

The user-level makes no such guarantee, so there is a new xid
for each retransmission. In particular, this means that the
mount program's RPC request to the mount daemons *have* to be
answered before the timeout period is up otherwise that reply
is discarded as out of date. Ultimately this becomes a race
condition, especially as the mount requests are small and the
machine can buffer lots of them. We had an NFS server with 40
clients that was a 0.5 MIP Whitechapel MG1 - when all 40
clients rebooted after a power failure it was taking about 3
minutes from a client sending a request to the mountd sending
the reply, by which time there were a lot of 25 second timeouts
gone by. Funny thing is, every mountd response is identical, so
the first one would do and the rest can be discarded....
You are just lucky that your server occasionally gets in there
quick enough!

>>The mount server has to read /etc/exports, and to do the host name to IP
>>address translation would also have to access /etc/hosts (or the name
>>server), and
>>              ***it writes /etc/rmtab***      [ my emphasis ]
>>. So we thought mountd might be having
>>trouble getting to /etc. But ps "snapshots" showed mountd rarely waiting
>>on disk.

To be more specific, it does a linear scan through rmtab
looking to see if this mount request is already there and
adds onto the end if it isn't.

On my main machine /etc/rmtab is 978 lines long.

The reason it is so long is that most clients unmount their
disks by crashing, so the rmtab file never gets cleared by
unmount requests. On our MG1 servers we reniced the mountd to
-15 and removed all the /etc/rmtab nonsense.

I'm sorry Chuq, but all that stuff about relentless mashing of
mbufs just doesn't sound at all plausible, especially since the
lucky clients who have already mounted are getting good service.

(If it hadn't been from someone who ought to know I would have
 loudly decried it as complete *@*!%*, but perhaps I'm not so
 certain of my ground...)


The Bottom Line:

1) Change mount to use a TCP connection to the mountd, or
   otherwise provide an idempotent RPC
2) Change mountd to use a dbm file or some other means
   or speeding up the search through rmtab.
3) Encourage people to remove rmtab as part of the boot sequence!


Actually, idempotent RPC is an easy and valuable thing to do,
especially as you just say "Buyer beware" and treat "idempotent
RPC" to mean "don'T increment the xid for each retransmission".
-- 

William Roberts         ARPA: liam@cs.qmc.ac.uk
Queen Mary College      UUCP: liam@qmc-cs.UUCP    AppleLink: UK0087
190 Mile End Road       Tel:  01-975 5250
LONDON, E1 4NS, UK      Fax:  01-980 6533