Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!gem.mps.ohio-state.edu!pacific.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!ICAEN.UIOWA.EDU!dbfunk From: dbfunk@ICAEN.UIOWA.EDU (David B Funk) Newsgroups: comp.sys.apollo Subject: Re: Doamin on Ethernet problem Message-ID: <8910210447.AA02085@icaen.uiowa.edu> Date: 21 Oct 89 03:15:33 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: Iowa Computer Aided Engineering Network, University of Iowa Lines: 57 WRT posting <1127@cernvax.UUCP> > Hi there, we are experiencing lot of problems on ours, ethernet based, > Apollos. > The problem has been seen both on 3500 and 3000 with ethernet as primary > (and only in most cases) network. > The node will loose contact with the network, both at DDS and tcp/ip level, > rtstat -dev shows enormous numbers for 'no resources', some 20000 per second I can think of 2 possible causes of this problem: 1) The "ethernet8_microcode" that was shipped with sr10.1 is seriously flawed. The sr9.7 version was not perfect but not nearly as bad as the 10.1. It is worst in a DDS & IP Ethernet environment, if you are only running IP the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches out for this but most of them aren't worth messing with. The best solution is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even from the Beta1 sr10.2 release. This "ethernet8_microcode" works VERY well and can be safely installed on sr9.7 & sr10.x systems. We've been using it for 2 months now and are quite pleased with it. Talk to your local Apollo office, they may be able to get it for you. Just copy the file into /sys and reboot. Here's a "rtstat" off one of our ring/E-net gateways, note the low E-net error rates: $ rtstat -dev -net ---------------------------------------------------------------- 80FF1500.12E88 pkts routed: 526964 queue oflo: 0 Ring pkts sent: 2559538 pkts rcvd: 2743208 NACKs 987 WACKs 27872 Xmit bus err 0 Xmit timeouts 303 Token inserted 58 Rcv DMA EOR 0 Rcv CRC error 0 Rcv timeouts 1 Rcv bus error 0 Rcv xmtr error 1033 towards net: 80FF1500 ref cnt: 2389382 towards net: 80FF1300 ref cnt: 229121 ETH802.3_AT pkts sent: 309484 pkts rcvd: 266299 Hdwr xmits 1700895 Hdwr rcvs 3304708 CRC errors 0 Misalignments 5 No resource 0 Over-run 2 Adapter err 0 Full socket 694 towards net: 80FF4000 ref cnt: 313524 2) There is a bug in the sr10.0 & sr10.1 implementation of the "rgyd". This can cause various strange problems that often look like the rgyd dying. When this happens, system operations that deal with user IDs or protections (like "getpwuid") may cause network retrys. You say that "ls -l" will generate the problem, the "-l" option causes "rgyd" operations because of the need to extract the owner name. If you do a dn3k to dn10k network operation that doesn't involve "rgyd" operations does the problem still happen? Try a utility like "/com/lst" (sr10; under sr9.7 its "/systest/lst"). Dave Funk