Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!usc!apple!ames!sgi!vjs@rhyolite.wpd.sgi.com From: vjs@rhyolite.wpd.sgi.com (Vernon Schryver) Newsgroups: comp.protocols.tcp-ip Subject: Re: Domain Name Screaming Summary: turn it off Message-ID: <37397@sgi.SGI.COM> Date: 2 Jul 89 01:28:09 GMT References: <8906292105.AA07216@fornax.ece.cmu.edu> Sender: daemon@sgi.SGI.COM Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 53 In article <8906292105.AA07216@fornax.ece.cmu.edu>, mathis@FORNAX.ECE.CMU.EDU (Matt Mathis) writes: > Does anybody have a description of the bug (and fix) of the interaction > between Yellow Pages and the Domain Name System? This bug causes some > systems to send DNS requests at high rates for sustained periods in response > to remote DNS server or network failures. High rates means typically 20 > per second for workstations. I have clocked some at as much as 100 pps! > Needless to say this is hard on gateways, and disaster to people behind > 56k links. If this is not what you are talking about, please excuse me. repeat by: 1) some program decides to do gethostbyname(foo.bar) or gethostbyaddr(1.2.3.4), checks with portmap & ypbind, and sends an rpc request to the correct ypserv. 2) ypserv gets the request, fails to find the key in the YP map, and since YP-to-DNS is turned on, forks a child which does an DNS lookup. 3) the link to the DNS root or correct authorative server is down or congested, so the child does not get an answer for a while. 4) meanwhile, the original program in step #1 is waiting for the answer. If step #3 takes long enough, the original program does a normal YP-rpc timeout, retries, and everything is repeated from step #1 This is worse than it looks because the time-out in step #4 is less than the one in used by the child in step #3. One can get large numbers of children of ypserv, all asking the local DNS server for the same answer. Some programs, ypmatch may be one, seem to try forever. This would generate an unbounded, linearly increasing amount of DNS traffic, except that one usually runs out of resources for the local nameserver and ypserv parent. The Internet link to sgi.sgi.com is only 9.6b/s. When this happens here, it is noticed. One version of ypserv from Sun adjusted the time the parent waits for its children to reduce the number of children. That helped. We've taken to doing more. First, our ypserv limits the number of children it has outstanding. Second, it keeps track of what its children are asking and does not start new ones while old ones are asking the same DNS question. Third, it caches both negative answers and time-outs from children, and responds immediately rather making a new child to ask DNS. Caveats: Least bad values for the cache aging are not obvious to me. We've been running the sum of these fixes for a short time, and cannot be certain they are sufficent. Not all of these fixes are yet in currently shipping SGI products. Vernon Schryver Silicon Graphics vjs@sgi.com