Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!gem.mps.ohio-state.edu!pacific.mps.ohio-state.edu!tut.cis.ohio-state.edu!ucbvax!ICAEN.UIOWA.EDU!dbfunk
From: dbfunk@ICAEN.UIOWA.EDU (David B Funk)
Newsgroups: comp.sys.apollo
Subject: Re: Doamin on Ethernet problem
Message-ID: <8910210447.AA02085@icaen.uiowa.edu>
Date: 21 Oct 89 03:15:33 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: Iowa Computer Aided Engineering Network, University of Iowa
Lines: 57

WRT posting <1127@cernvax.UUCP>

> Hi there, we are experiencing lot of problems on ours, ethernet based,
> Apollos.
> The problem has been seen both on 3500 and 3000 with ethernet as primary
> (and only in most cases) network.
> The node will loose contact with the network, both at DDS and tcp/ip level,
> rtstat -dev shows enormous numbers for 'no resources', some 20000 per second

I can think of 2 possible causes of this problem:

1)  The "ethernet8_microcode" that was shipped with sr10.1 is seriously flawed.
    The sr9.7 version was not perfect but not nearly as bad as the 10.1. It
    is worst in a DDS & IP Ethernet environment, if you are only running IP
    the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches
    out for this but most of them aren't worth messing with. The best solution
    is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even
    from the Beta1 sr10.2 release. This "ethernet8_microcode" works VERY well
    and can be safely installed on sr9.7 & sr10.x systems. We've been using
    it for 2 months now and are quite pleased with it. Talk to your local
    Apollo office, they may be able to get it for you. Just copy the file
    into /sys and reboot. Here's a "rtstat" off one of our ring/E-net
    gateways, note the low E-net error rates:

  $ rtstat -dev -net

  ----------------------------------------------------------------
  80FF1500.12E88   pkts routed:    526964   queue oflo:        0

   Ring            pkts sent:     2559538   pkts rcvd:   2743208
                   NACKs              987   WACKs          27872
                   Xmit bus err         0   Xmit timeouts    303
                   Token inserted      58   Rcv DMA EOR        0
                   Rcv CRC error        0   Rcv timeouts       1
                   Rcv bus error        0   Rcv xmtr error  1033
                   towards net:  80FF1500   ref cnt:     2389382
                   towards net:  80FF1300   ref cnt:      229121

   ETH802.3_AT     pkts sent:      309484   pkts rcvd:    266299
                   Hdwr xmits      1700895  Hdwr rcvs     3304708
                   CRC errors           0   Misalignments      5
                   No resource          0   Over-run           2
                   Adapter err          0   Full socket      694
                   towards net:  80FF4000   ref cnt:      313524


2)  There is a bug in the sr10.0 & sr10.1 implementation of the "rgyd".
    This can cause various strange problems that often look like the
    rgyd dying. When this happens, system operations that deal with
    user IDs or protections (like "getpwuid") may cause network retrys.
    You say that "ls -l" will generate the problem, the "-l" option causes
    "rgyd" operations because of the need to extract the owner name.
    If you do a dn3k to dn10k network operation that doesn't involve
    "rgyd" operations does the problem still happen? Try a utility
    like "/com/lst" (sr10; under sr9.7 its "/systest/lst").

Dave Funk