Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!amdcad!ames!ll-xn!mit-eddie!husc6!cmcl2!phri!roy From: roy@phri.UUCP (Roy Smith) Newsgroups: comp.unix.wizards,comp.bugs.4bsd Subject: Re: sendmail 5 times faster!?!? Message-ID: <2994@phri.UUCP> Date: Sun, 1-Nov-87 10:42:07 EST Article-I.D.: phri.2994 Posted: Sun Nov 1 10:42:07 1987 Date-Received: Fri, 6-Nov-87 03:17:08 EST References: <751@cad.luth.se> Reply-To: roy@phri.UUCP (Roy Smith) Organization: Public Health Research Inst. (NY, NY) Lines: 52 Summary: mail system unstable under heavy load Xref: mnetor comp.unix.wizards:5298 comp.bugs.4bsd:614 In article <751@cad.luth.se> Sven-Ove Westberg writes: > I have found a big performance pig in sendmail. This patch works fine on > a Sun3. But it is hard too test the patch since the changes only affect > the timeout handling. My code is 4-5 times faster on big mails. Not being a sendmail guru (I shudder everytime I contemplate editing a sendmail.cf file) I can't say if Sven-Ove's patch is good or bad, but I thought I'd mention a (possibly) realated sendmail problem that has been bugging us, namely mail storms (akin to the famous IP broadcast storms). The basic setup is a 4.3 Vax-11/750 doing uucp for a bunch of (mostly diskless) 3.[02] Sun-3's. Mail for root, news, usenet, postmaster, etc is all forwarded to the same two people (I'm one of them). We each get our mail on different diskless 3/50, but with both / and /usr/spool/mail directories on the same server. The problem crops up when a whole bunch of mail comes in at once; the most common cause being some downstream news site running out of disk space and dumping 20 or 30 "rnews: execution failed" messages on us in a single uucp connection. Each one generates two sendmail connections (one for my copy and one for the other person's copy) driving the loads on the recieving workstations (and their joint server) through the roof. Eventually, the workstations can't keep up and the sendmail, ND, and/or NFS connections start to time out. Somebody's probably doing a lot of YP too. Now the real fun begins. Each timed-out connection generates another error message mailed to root (or maybe postmaster?), which in turn gets forwarded to both of us on our already over-loaded machines. At this point, the system has become unstable, with error messages (times 2) being generated faster than they can be delivered. Usually, the end result is the load on the beleagured Suns going up and up and up until they crash (often with "panic: sbflush 2"). Ever see a perfmeter on a 3/50 roll over to the 0-32 load scale? It's not a pretty sight. Once the clients crash, things tend to quiet down; I think when the vax sendmail tries to connect to a machine that is down, it just queues the message without generating an mailed error message; only when it gets the unexpected errors from the receiving deamon in the middle of a connection does it freak out and generate more mail. Of course, by the time the server has been floundering for 10 minutes, other people think their diskless workstations have crashed and try rebooting; all those nodes screaming for tftp connections and rarping to find out their names doesn't help the situtation, but that's a different story. I don't know what the proper solution is, but something has to be added somewhere to keep sendmail from going super-critical like this. I note with mild interest that this is the only time I've ever seen our 750 do something which our Suns couldn't keep up with. Maybe if I ran YP on the vax I could slow down sendmail enough to provide the needed damping? :-) -- Roy Smith, {allegra,cmcl2,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016