Path: utzoo!attcan!uunet!lll-winken!ames!pasteur!ucbvax!tut.cis.ohio-state.edu!rutgers!att!ulysses!mhuxo!mhuxu!alux2!matthews
From: matthews@alux2.ATT.COM (John Matthews)
Newsgroups: comp.protocols.tcp-ip
Subject: NFS Performance through Routers
Message-ID: <237@alux2.ATT.COM>
Date: 18 Mar 89 21:30:56 GMT
Reply-To: ulysses!aloft!matthews@princeton.edu (John Matthews)
Organization: Laboratory 5223
Lines: 76

Last week we replaced a DEC Lan Bridge with a new Proteon P4200 router
to create a local subnet for our building.  Ever since, things have
been running extremely slow for the people that get their CAD software
through our gateway.  They do rely heavily on huge CAD executables that
get sent through the gateway.  I have been looking into this for quite
some time and I am finally posting a message here to see what other
people have done in similar situations.  What I have found is that
default mounts in NFS make reads and writes that are 8192 bytes long.
The kernel gets these and then in turn fragments these requests into up
to 9 UDP packets.  If the proteon discards even one of these packets,
all 9 of them have to get retransmitted.  I went around and changed all
of the NFS mounts to do 1024 byte reads and writes.  This seemed to
improve things a little.  Another thing that I have noticed is that we
are getting extremely high collision rates on the SUNS.  They add up to
about a million and a half for the past week.  Someone told me that the
SUNS don't abide by a standard that says they should wait 10
milliseconds between each packet they send to give others a chance to
transmit.  They told me they only wait 1 millisecond, if that.   Could
this be causing alot of collisions?  There are only around 15 Sun
clients and one server sitting on each of two bridged ethernets in the
building where they are having all of the problems.  In the main
building we have things set up the same way with collisions adding up
to around 40,000.  There does seem to be alot of broadcasting going on
that ethernet that could cause this.  There is a problem stemming from
the fact that older versions of UNIX are trying to forward IP broadcast
packets.  When these hosts receive a broadcasted RIP packet addressed
to 128.94.255.255, they think it's a packet destined to a specific
machine and they then try and forward it.  For every such packet, an
ARP request is broadcasted on the ethernet.  There are about 16
machines running the old network software and 5 routers generating up
to 5 rip packets every 30 seconds.  I believe that added up to around
28,000 broadcasts per hour.  Temporarily, I answered these ARP requests
and pointed them to a device that would ignore them, but the network is
still slow.  Is there anything wrong with responding to these ARP
requests with an ethernet address that doesn't really exist on that
network.  Then the machines running the old network software would just
forward it into a black hole.  Am I thinking right or would this cause
problems?  What will the DEC Lan Bridges do with an ethernet packet
when it has no idea which side that ethernet device is really on.  Will
every bridge throughout the network pass this packet everytime it's
sent?

Last night we tried to configure an extra ethernet board on the
fileserver that houses all of the CAD software and connect it to the
other ethernet cable to give them back some speed.  All we did was
uncomment the ie1 interface in the kernel config file, recompile the
kernel and reboot.  We didn't change any of the /etc/*rc* files at
all.  When the sun came back up, all of the old NFS mounts on the
clients just timed out.  The NFS deamons wouldn't service any NFS
requests.  I was able to use telnet and rlogin to connect to hosts on
either side after manually ifconfig'ing the new ie1 interface.  I gave
up and tried it on another fileserver.  It did the exact same thing.
The thing that doesn't make sense is that the only thing we did was add
one ethernet device to the kernel and then nothing worked the way it
used to.  We rebooted on the old kernels and everything was back to
normal. We called SUN but they didn't seem to know what the problem
was.  Has anyone else ever encountered such problems?

We are eventually going to move some of that software to servers in
that building so that they aren't pounding on the gateway.  I wasn't
aware that the proteon had such little bandwidth compared to a LAN
bridge.  How on earth can they go from Pronet-80 to ethernet when they
can't come close to handling ethernet's full 10 megabits/s?  What
percent of 10 mbits/s can a proteon really route from one ethernet to
another?  Has anyone done some real life performance testing?

What other things could I do to optimize NFS traffic?

If there are things that I am wrong about, please let me know.  This
has been a frustrating week to say the least.  If anyone could spare a
few minutes on the phone, please e-mail me your phone number.  I'd
really appreciate it.
				John Matthews
				ulysses!aloft!matthews@princeton.edu
				matthews@aloft.att.com
				matthews@research.att.com