Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!amdcad!ames!ll-xn!mit-eddie!husc6!cmcl2!phri!roy
From: roy@phri.UUCP (Roy Smith)
Newsgroups: comp.unix.wizards,comp.bugs.4bsd
Subject: Re: sendmail 5 times faster!?!?
Message-ID: <2994@phri.UUCP>
Date: Sun, 1-Nov-87 10:42:07 EST
Article-I.D.: phri.2994
Posted: Sun Nov  1 10:42:07 1987
Date-Received: Fri, 6-Nov-87 03:17:08 EST
References: <751@cad.luth.se>
Reply-To: roy@phri.UUCP (Roy Smith)
Organization: Public Health Research Inst. (NY, NY)
Lines: 52
Summary: mail system unstable under heavy load
Xref: mnetor comp.unix.wizards:5298 comp.bugs.4bsd:614

In article <751@cad.luth.se> Sven-Ove Westberg <sow@cad.luth.se> writes:
> I have found a big performance pig in sendmail.  This patch works fine on
> a Sun3.  But it is hard too test the patch since the changes only affect
> the timeout handling.  My code is 4-5 times faster on big mails.

	Not being a sendmail guru (I shudder everytime I contemplate editing
a sendmail.cf file) I can't say if Sven-Ove's patch is good or bad, but I
thought I'd mention a (possibly) realated sendmail problem that has been
bugging us, namely mail storms (akin to the famous IP broadcast storms).

	The basic setup is a 4.3 Vax-11/750 doing uucp for a bunch of (mostly
diskless) 3.[02] Sun-3's.  Mail for root, news, usenet, postmaster, etc is
all forwarded to the same two people (I'm one of them).  We each get our mail
on different diskless 3/50, but with both / and /usr/spool/mail directories
on the same server.

	The problem crops up when a whole bunch of mail comes in at once; the
most common cause being some downstream news site running out of disk space
and dumping 20 or 30 "rnews: execution failed" messages on us in a single
uucp connection.  Each one generates two sendmail connections (one for my
copy and one for the other person's copy) driving the loads on the recieving
workstations (and their joint server) through the roof.  Eventually, the
workstations can't keep up and the sendmail, ND, and/or NFS connections start
to time out.  Somebody's probably doing a lot of YP too.

	Now the real fun begins.  Each timed-out connection generates another
error message mailed to root (or maybe postmaster?), which in turn gets
forwarded to both of us on our already over-loaded machines.  At this point,
the system has become unstable, with error messages (times 2) being generated
faster than they can be delivered.  Usually, the end result is the load on
the beleagured Suns going up and up and up until they crash (often with
"panic: sbflush 2").  Ever see a perfmeter on a 3/50 roll over to the 0-32
load scale?  It's not a pretty sight.  Once the clients crash, things tend to
quiet down; I think when the vax sendmail tries to connect to a machine that
is down, it just queues the message without generating an mailed error
message; only when it gets the unexpected errors from the receiving deamon in
the middle of a connection does it freak out and generate more mail.

	Of course, by the time the server has been floundering for 10
minutes, other people think their diskless workstations have crashed and try
rebooting; all those nodes screaming for tftp connections and rarping to find
out their names doesn't help the situtation, but that's a different story.

	I don't know what the proper solution is, but something has to be
added somewhere to keep sendmail from going super-critical like this.  I note
with mild interest that this is the only time I've ever seen our 750 do
something which our Suns couldn't keep up with.  Maybe if I ran YP on the vax
I could slow down sendmail enough to provide the needed damping? :-)
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016