Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!cis.ohio-state.edu!karl_kleinpaste
From: karl_kleinpaste@cis.ohio-state.edu
Newsgroups: comp.sys.pyramid
Subject: help with mbuf leak problem?
Message-ID: <KARL.90Sep15103341@giza.cis.ohio-state.edu>
Date: 15 Sep 90 14:33:41 GMT
Sender: news@tut.cis.ohio-state.edu
Organization: Ohio State Computer Science
Lines: 84

Pyramid 98xe, OSx4.4c, nfsd 8, biod 8.

I've developed a nasty problem with one of my Pyrs in the last 16
hours or so.  It has developed a serious problem with mbuf lossage.
Here's a netstat -m output just before his last reboot, about 10
minutes ago:

	2003/2032 mbufs in use:
	        1877 mbufs allocated to data
	        12 mbufs allocated to packet headers
	        109 mbufs allocated to routing table entries
	        3 mbufs allocated to socket names and addresses
	        2 mbufs allocated to interface addresses
	128/128 mapped pages in use
	510 Kbytes allocated to network (99% in use)
	1 requests for memory denied

Note excessive data mbuf allocation, and 99% utilization.  Consider
the same thing from his twin, in the next cabinet, looking quite
normal and running for days:

	86/288 mbufs in use:
	        3 mbufs allocated to data
	        4 mbufs allocated to packet headers
	        75 mbufs allocated to routing table entries
	        2 mbufs allocated to socket names and addresses
	        2 mbufs allocated to interface addresses
	28/96 mapped pages in use
	228 Kbytes allocated to network (29% in use)
	0 requests for memory denied

This leakage started happening sometime around 5pm or 6pm last
evening.  I have had to reboot almost hourly just to keep the @#$%
machine alive.  I've experimented with several things, trying to find
the cause.  Killing off assorted network daemons didn't help;
sendmail, nntp, inetd as a whole, routed, pcnfsd were all killed, and
yet the data mbuf allocation keeps ratcheting upward.  I tried
rebooting with 16 nfsd/biod but this was no help either.  Killing off
all nfsd/biod and the portmapper didn't help.  Renicing nfsd and/or
biod didn't help.  As near as I an see, nothing running on the Pyr
itself is the cause of this.

"etherfind -r -n src victim-pyr or dst victim-pyr" run from a nearby
SunOS4.1 Sun3 shows a great deal of NFS traffic, of this form:

	UDP from another-pyr.1023 to victim-pyr.2049  128 bytes
	 RPC Call prog 200000 proc 1 V1 [93dc7]
	UDP from victim-pyr.2049 to another-pyr.1023  104 bytes
	 RPC Reply  [93dc7] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  172 bytes
	 RPC Call prog 200000 proc 9 V1 [93dc8]
	UDP from victim-pyr.2049 to another-pyr.1023  36 bytes
	 RPC Reply  [93dc8] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  172 bytes
	 RPC Call prog 200000 proc 9 V1 [93dc9]
	UDP from victim-pyr.2049 to another-pyr.1023  36 bytes
	 RPC Reply  [93dc9] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  128 bytes
	 RPC Call prog 200000 proc 1 V1 [93dca]
	UDP from victim-pyr.2049 to another-pyr.1023  104 bytes
	 RPC Reply  [93dca] AUTH_NULL Success

But not all of this traffic is coming from another-pyr -- assorted
Pyrs, Suns, and the occasional HP show up.

I'm also getting messages like
	NFS server write failed: (err=13, dev=0xffa610a4, ino=0xffa69bd0).
on the console occasionally.  Errno 13 is EACCES.  ???

The only anomalous thing about this Pyr's configuration is that it's
the departmental /usr/spool/mail NFS server.  But that's been the case
for a couple of years now, nothing new or unusual about that.

As I said, I'm rebooting roughly hourly at this point to keep it
alive.  It seems to perform admirably right up until the end, when the
2032/2032 mbuf condition hits.  It reboots in 10 minutes and is fine
again for the next hour, while the mbuf count goes up.

Clues, anyone?  I can't think of anything that would have been started
at 5pm on a Friday evening which might cause this sort of thing.  What
sort of activity on the Pyr or elsewhere on my network should I be
looking for?

--karl