Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!tut.cis.ohio-state.edu!ucbvax!ICAEN.UIOWA.EDU!dbfunk From: dbfunk@ICAEN.UIOWA.EDU (David B Funk) Newsgroups: comp.sys.apollo Subject: Re: tcpd problems Message-ID: <9105231001.AA01218@icaen.uiowa.edu> Date: 23 May 91 08:43:31 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: Iowa Computer Aided Engineering Network, University of Iowa Lines: 74 In posting , Ian Barkley writes: >Hello all. We have been having a problem recently with our tcpd. Recently, the >connections to the outside world have been having a bad habit of dying or >freezing for several minutes at a time. When I looked at the process list of > [stuff deleted] >One time I tried running tcpd's nice number up to -18, so that it couldn't >get a priority less than 3 (that of most of the other processes on typhoon). >It worked fine for a while, then BLAM, the priority went to 3 and tcpd started >eating 100% of the CPU time; the disk was accessable (slowly), but I couldn't >get on and finally had to white-button the machine. > >These freezes seem to happen mostly under heavy loads. They last from a few >seconds to several minutes. Even when tcpd is working, it's PRI tends to >bounce between 1 and 3, even with a nice of -15. Yes, we have seen this exact same problem. It is caused by packet buffer exhaustion (those little buffers work real hard ;). The clue is the output from a "netstat -m" (or the memory part of the -T output) Refering back to the output listing from your posting: > 335/336 mbufs in use: > 111/112 ( 80-byte) mbufs used/allocated > 112/112 (1560-byte) mbufs used/allocated > 112/112 (9216-byte) mbufs used/allocated > 1187 Kbytes allocated to network (99% in use) > 1792761 requests for memory denied Note the "99% in use" statistic. All the packet buffers are being used by somebody so when a packet comes in from the net, it gets dropped on the floor as there is no place to put it. When you reach this condition, the value of the statistic "requests for memory denied" will start going up like crazy. when this condition is reached, the tcp daemon (/etc/tcpd) will go off into some routine that tries to expand the memory buffer pool. (If you do a "tb" trace back on the running tcpd, you can see this; it should be in "expand_mbufpool"). For some reason this routine goes into a spin loop & eats up lots of CPU time with out doing much. This is the cause of your node lock up when you jacked up the tcpd priority. (Don't bother to jack the priority, for some reason the routine never seems to do any thing but eat up CPU time). THe only cure that I've found for this condition is to find the offending processes that are holding the buffers & kill them. Once the processes die, their buffers are released & things usually get back to normal. (In severe cases, reboot or kill all tcp related stuff & restart tcpd from scratch). The "active connections list" output from netstat can provide a clue as to who the real culprit is. Look for connections that have a large Recv-Q or Send-Q value. (the values seem to saturate at 9100, above 7500 is possible trouble) (then comes the fun of trying to match the process to the netstat connection). Some of the problem processes that I have seen are things like: A download via a program such as "sz", when the process at the far end dies or becomes unreachable due to network problems. (causes a large Send-Q) A telnet session where the user is running a program producing lots of output and then the remote user gives his telnet program a stop fault (^Z). (Send-Q) Incoming ftp data, where the disk is full. (large Recv-Q). If it is just occasional peaks in the network load, you may just have to live with it (or reduce the process load on your gateway). It is probably being caused by a process executing on that node. Pure gateway operations (IE traffic coming from one remote node and going to some other node) don't seem to eat up lots of buffers. All the cases that I've seen were due to some user's program executing on the afflicted node. We have one machine that is our primary gateway/mail-server/name-server, which handles lots of traffic, but it doesn't have any users on it. This machine handles millions of packets a day with no "requests for memory denied" errors. We have another machine that has a popular "MUD" game which can generate a million "requests for memory denied" errors in a few hours, when it has a heavy load (30+ active users). Good luck. Dave Funk