Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!cis.ohio-state.edu!karl_kleinpaste From: karl_kleinpaste@cis.ohio-state.edu Newsgroups: comp.sys.pyramid Subject: Re: help with mbuf leak problem? Message-ID: Date: 18 Sep 90 15:10:52 GMT References: Sender: news@tut.cis.ohio-state.edu Organization: Ohio State Computer Science Lines: 95 On Saturday, I wrote in desperation: I've developed a nasty problem with one of my Pyrs in the last 16 hours or so. It has developed a serious problem with mbuf lossage. I received the following extremely helpful, detailed reply from Robert Ehrlich. The cause he describes correlates perfectly with our problem here; we recently upgraded several hundred Sun3/50s under SunOS3.5 to Sun SLCs under SunOS4.1, and the SLC causing the abuse (finally found with a combination of etherfind and shutting down Sun subnets in turn) belongs to a faculty member who is in 9 groups when the Pyrs only believe in 8. We are now seeing to it that all users are in no more than 8 groups. Is the problem Robert describes "fixed in 5.0?" The Pyr RTOC (or CCSC or whatever it's called now) folks have been pestering me for a while to upgrade (4.4c here still), but I've been putting it off due to a lack of excitement over the prospect. But if it'll take care of this problem for me, I'll probably see about upgrading Real Soon Now, as in, a couple of weekends from now. Many thanx to Robert. --karl ---------------- UNIX-From: ehrlich@inria.inria.fr Mon Sep 17 19:41:33 1990 Received: from inria.inria.fr by cheops.cis.ohio-state.edu (5.61-kk/5.900912) id AA03385; Mon, 17 Sep 90 19:41:33 -0400 Received: by inria.inria.fr (5.64+/89.0.8) via Fnet-EUnet id AA12608; Tue, 18 Sep 90 01:42:17 +0200 (MET) Date: Tue, 18 Sep 90 01:42:17 +0200 From: ehrlich@inria.inria.fr (Robert Ehrlich) Message-Id: <9009172342.AA12608@inria.inria.fr> To: karl_kleinpaste@cis.ohio-state.edu Subject: Re: help with mbuf leak problem? Cc: dupont@inria.inria.fr Lines: 56 We had a very similar problem on a Pyramid here, and after a long investigation I discovered the reason. In fact the problem is in no way specific of Pyramid, it can happen on any machine acting as NFS server running the code written by SUN as kernel NFS server. There is a bug in this code. It occurs when a RPC call whose server is in the kernel aborts early due to an error. Early means after the socket layer has delivered the packet to the RPC layer, but before calling the approriate dispatcher (in this case the nfs dispatcher). Such an error may be due to 3 causes: 1) authentication failure (e.g. client sending 16 groups to a server supporting a maximum of 8 groups (that was our case)); 2) program not found; 3) version mismatch. There a good chance that second case will never happen, because there is a port number associated whith the NFS server, packets for other programs will be sent to another port. But the cases 1) and 3) may occur, in fact the easiest way to see it is simply to do "rpcinfo -u host nfs", without giving a version number. Rpcinfo makes a first call with a null version number, gets an error reply with the proper version number and issue a second sucessful call. In all the 3 case mentionned above, HERE IS THE BUG, the mbuf containing the packet will never be freed. In the normal case, this is the job of the called program, but in case of these 3 kinds of error, as no program is called, nobody does it. You can see it perfectly by putting the command "rpcinfo -u host nfs" in a loop, and monitoring in the same time the free mbufs using netstat -m. After discovering that, I was wondering why it doesn't happen more often. It seems that the kind of errors causing the problem are very seldom. And you must have a lot of them until you notice the loss of mbufs. Excepted in the case some code loops on such an error. It was the case by us, we noticed the problem when sombody on a SUN started a program doing "df" on each mounted partition every hour. Unfortunatley the SUN was under SunOS 4.0 with 16 groups allowed, and the user doing that was the only one belonging to more than 8 groups. The result was a regular crash of the Pyranid 2 days after boot. I think in your case also something is looping somewhere on the error. Having source code from Pyramid, I had no difficulty in fixing the problem (you need only to add a call to the approriate routine freeing the mbufs in the 3 error cases). But of course I have no right to publish that. The bug was reported to Pyramid, with the fix. In fact I found the bug in all kernels running NFS, excepted in DEC Ultrix. I used mail rather than news for my answer, because I never use news, because when you have never read news the first time you try it takes a so long time that you always cancel it by ^C before. I know your problem because a friend reading news has told me about that. If you think my answer may be useful to others, feel free of sending any part of it in the appropriate newsgroup, in better english if you want, since, as you probably noticed, I am french. ----------------