Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!tut.cis.ohio-state.edu!ucbvax!ICAEN.UIOWA.EDU!dbfunk
From: dbfunk@ICAEN.UIOWA.EDU (David B Funk)
Newsgroups: comp.sys.apollo
Subject: Re: tcpd problems
Message-ID: <9105231001.AA01218@icaen.uiowa.edu>
Date: 23 May 91 08:43:31 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: Iowa Computer Aided Engineering Network, University of Iowa
Lines: 74

In posting <IANB.91May22203326@maelstrom.ocf.Berkeley.EDU>, Ian Barkley writes:

>Hello all. We have been having a problem recently with our tcpd. Recently, the 
>connections to the outside world have been having a bad habit of dying or 
>freezing for several minutes at a time. When I looked at the process list of
>
[stuff deleted]
>One time I tried running tcpd's nice number up to -18, so that it couldn't
>get a priority less than 3 (that of most of the other processes on typhoon).
>It worked fine for a while, then BLAM, the priority went to 3 and tcpd started
>eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't
>get on and finally had to white-button the machine. 
>
>These freezes seem to happen mostly under heavy loads. They last from a few
>seconds to several minutes. Even when tcpd is working, it's PRI tends to
>bounce between 1 and 3, even with a nice of -15. 

Yes, we have seen this exact same problem. It is caused by packet buffer exhaustion
(those little buffers work real hard ;).

The clue is the output from a "netstat -m" (or the memory part of the -T output)
Refering back to the output listing from your posting:

> 335/336 mbufs in use:
>         111/112 (  80-byte) mbufs used/allocated
>         112/112 (1560-byte) mbufs used/allocated
>         112/112 (9216-byte) mbufs used/allocated
> 1187 Kbytes allocated to network (99% in use)
> 1792761 requests for memory denied

Note the "99% in use" statistic. All the packet buffers are being used by somebody
so when a packet comes in from the net, it gets dropped on the floor as there is
no place to put it. When you reach this condition, the value of the 
statistic "requests for memory denied" will start going up like crazy.
when this condition is reached, the tcp daemon (/etc/tcpd) will go off into some
routine that tries to expand the memory buffer pool. (If you do a "tb" trace back
on the running tcpd, you can see this; it should be in "expand_mbufpool").
For some reason this routine goes into a spin loop & eats up lots of CPU time
with out doing much. This is the cause of your node lock up when you jacked up the
tcpd priority. (Don't bother to jack the priority, for some reason the routine
never seems to do any thing but eat up CPU time).

THe only cure that I've found for this condition is to find the offending processes
that are holding the buffers & kill them. Once the processes die, their buffers are
released & things usually get back to normal. (In severe cases, reboot or kill all tcp
related stuff & restart tcpd from scratch).

The "active connections list" output from netstat can provide a clue as to who the
real culprit is. Look for connections that have a large Recv-Q or Send-Q value.
(the values seem to saturate at 9100, above 7500 is possible trouble)
(then comes the fun of trying to match the process to the netstat connection).

Some of the problem processes that I have seen are things like:
    A download via a program such as "sz", when the process at the far end
    dies or becomes unreachable due to network problems. (causes a large Send-Q)

    A telnet session where the user is running a program producing lots of output
    and then the remote user gives his telnet program a stop fault (^Z). (Send-Q)

    Incoming ftp data, where the disk is full. (large Recv-Q).

If it is just occasional peaks in the network load, you may just have to live
with it (or reduce the process load on your gateway). It is probably being caused
by a process executing on that node. Pure gateway operations (IE traffic coming from
one remote node and going to some other node) don't seem to eat up lots of buffers.
All the cases that I've seen were due to some user's program executing on the afflicted
node. We have one machine that is our primary gateway/mail-server/name-server, which
handles lots of traffic, but it doesn't have any users on it. This machine handles
millions of packets a day with no "requests for memory denied" errors. We have another
machine that has a popular "MUD" game which can generate a million "requests for memory
denied" errors in a few hours, when it has a heavy load (30+ active users).

Good luck.
Dave Funk