Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!VAX1.CC.UAKRON.EDU!mcs.kent.edu!usenet.ins.cwru.edu!eagle!data.nas.nasa.gov!noc.arc.nasa.gov!gutierre From: gutierre@noc.arc.nasa.gov (Robert Michael Gutierrez) Newsgroups: comp.sys.proteon Subject: Re: Overview Problem Message-ID: <1990Oct20.012205.28331@nas.nasa.gov> Date: 20 Oct 90 01:22:05 GMT References: <9010170358.AA02833@umd5.UMD.EDU> <9010171142.AA08697@sayshell.umd.edu> <1990Oct19.200834.5767@noc.sura.net> Sender: news@nas.nasa.gov Reply-To: gutierre@noc.arc.nasa.gov (Robert Michael Gutierrez) Organization: NASA Science Internet - Network Operations Center Lines: 53 oleary@noc.sura.net (dave o'leary) writes: > Our Overview problems seem to be resolved for the moment. > > There were a few different things that we did - > > 1/ we watched the ethernet with a sniffer and saw that there were > no broken ethernet packets .... however the > SNMP part of the packets (i.e. the UDP data) was somehow > munged. Bingo. We found out the same problem last night when our Overview showed the usual signs of crashing (this time, only 1 node went red instead of all of them. Watching the packets, our engineer noticed that the SNMP data was corrupted, but after a reboot, all was fine. > Worse, some of the gateways were crashing with a NM_6B8 bughalt. We configured all our gateways as read-only (I thought this was a little parinoid, but now, it seems to have been A Good Thing). Were your gateways configured as full read-write??? We've never thought of this angle where any of our gateways crashed at the same time Overview crashed, because we're too busy waiting for the PC to boot back up, and trying to delete all the alerts that were accumilated. > Needless to say, this is less than ideal behavior. We also got > an error on the monitor process of the gateway reporting a > bad SNMP packet. As a hack to get around this we started pinging > a bunch of the gateways instead of SNMP querying them - at least > it kept Overview and the gateways from crashing. I currently have console output from one of our routers being monitored & logged because it was crashing numerous times. Now that this connection between bad SNMP packets and crashing routers is a possibility, I'll sort through the output for those appropriate messages. > 2/ We got new software from Proteon - I'm not sure of the details, but > it was four executables that replaced older versions. This > seemed to help significantly. Was this Overview software, or PC-TCP software? Again, when our Overview crashes, we have no buffers free anymore, hence no programs can communicate with the PC-TCP driver anymore. > 3/ We backed up the hard disk, reformatted it, and reinstalled everything. > This was completed at about 10 last night and things seem to > have worked since last night. That was our first step loooong time ago. Obviously, it never worked. Robert Michael Gutierrez NASA Science Internet Office - Network Operations Center. Ames Research Center, Moffett Field, California. USA.