Path: utzoo!mnetor!uunet!nbires!hao!gatech!ukma!david From: david@ms.uky.edu (David Herron -- Resident E-mail Hack) Newsgroups: comp.dcom.lans Subject: Re: ether errors -- vaxen&4.3&suns&sequent&etc Message-ID: <8345@e.ms.uky.edu> Date: 15 Feb 88 22:03:18 GMT References: <8277@e.ms.uky.edu> Reply-To: david@ms.uky.edu (David Herron -- Resident E-mail Hack) Organization: U of Kentucky, Mathematical Sciences Lines: 116 In article <8277@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes: >hi .. > >we've got a strange little problem. we're noticing a fairly high >error rate on *some* of our machines and not others. Well, it seems I wasn't quite clear enough before ... When I said "errors", I meant the number returned by the netstat program. Yes, I know that it lumps a whole bunch of errors into one statistic. But it's what we got. (We are working on getting some other gadgets set up -- but lack of money precludes getting a proper ethernet analyzer). Here's a fairly typical minute or so of our ethernet: | 21 - e:david --> netstat -i 5 input (qe0) output input (Total) output packets errs packets errs colls packets errs packets errs colls 7363687 220240 3949440 2829 42243 7489031 220240 4074784 2829 42243 36 2 19 0 0 38 2 21 0 0 33 1 9 0 0 33 1 9 0 0 33 1 9 0 0 33 1 9 0 0 55 4 4 0 0 55 4 4 0 0 16 0 7 0 0 18 0 9 0 0 22 1 1 0 0 22 1 1 0 0 23 1 1 0 0 23 1 1 0 0 37 3 7 0 0 37 3 7 0 0 25 0 4 0 0 25 0 4 0 0 54 5 3 0 0 54 5 3 0 0 28 0 7 0 0 28 0 7 0 0 22 1 0 0 0 22 1 0 0 0 23 0 2 0 0 23 0 2 0 0 36 0 12 0 0 36 0 12 0 0 27 2 3 0 0 27 2 3 0 0 51 4 1 0 0 51 4 1 0 0 3 0 1 0 0 3 0 1 0 0 The machine in question is e.ms.uky.edu, a uVaxII which serves partly as a file server, partly as our news machine, our primary domain nameserver, and partly as the work machine for some of the staff. The active connections are mainly rlogin's -- I have a couple going at the moment which are quiet -- an nntp (to harvard) and a couple of nfs connections. The board being used is a DEQNA, I'm not sure if it's a "new" or "classic" DEQNA. We do have one machine with both; we're running it with the "new" DEQNA right now and it's showing the same sort of error rates. The sun's and the sequent are different both in the error rates and in the type of error. The sun has "error" rates a couple orders of magnitude less than this, and the sequent has "error" rates a couple orders of magnitude less than the sun. Further, they both have a strong tendancy for collisions in preference to "error"'s. Now, the sun (I'm sampling from the server machine) has 4 workstations using both nfs and nd from it and seeing quite a bit of traffic. Frequent 30 second bursts of 100-300 packets a sec on input and in those same time slices, the output packet rate at about 2/3 the input. In my watching right at this minute, the errors are predominately collisions with occasional "errors". The collisions "seem" to be periodic -- at times there are regular (every 30 seconds) bursts of collision activity, with a high rate if input packets at those same bursts. I'm inclined to point a finger at rwho over that one. On the other hand, the pattern isn't there all the time. Whoever told me to get a lan analyzer so that I'm not guessing -- I see your point. But we don't got the bucks right now. The sequent is showing the same sorts of activity as the Sun server, except that it doesn't do nfs so therefore doesn't ever get those bursts of 100-300 packets per second (or are these numbers from netstat over the whole 5 second period?). Yes we do have both DELNI's and DEMPR's. We used to have a slightly illegal configuration that had paths of >2 DELNI's, but now all of our paths have a max of 2 DELNI's. Picking out a uVax2000 at random, the "error" rate is in the .01% range and ZERO collisions. The last machine is the 11/750 with a DEUNA. To begin with, it's a much quieter machine. It serves out very little NFS and its users don't go out to other machines very often. But at any rate, it does show the same sorts of error rates as the Suns and Sequent. That is, very few "errors" and more collisions. We are running MtXinu's 4.3 on it -- very nearly the same system as is on the uVaxIIen. At the moment I'm inclined to believe that we have a couple of problems. 1. Even the "new" DEQNA's can't keep up very well. Someone mentioned to us that Sun's when responding to NFS requests generate a block of data of whatever size the physical block size of the filesystem is -- which can be as much as 8K. This of course has to be broken up by the ethernet driver. The ether boards in Suns are apparently good enough that they can generate packets as fast as the ethernet spec allows. This, coupled with the design shortcuts made with the DEQNA, results in the DEQNA being overrun. Mark Hittinger mentioned that in vms there is an option to turn off hardware checksums in local area vaxclusters because of this sort of problem. There is some sort of bug in the checksumming hardware which shows up in heavily loaded ethernets. He suggested a newer board called a DELQUA or some such. 2. Our rwho's need to be staggered in some way. Any ideas? 3. It doesn't look like there's any bad boards -- and I can't really tell until I can put up an ethernet monitor of some sort. I'll be doing some more sleuthing as soon as I can get a pc/ip running in an AT -- but we gotta put an ether board in the pc first. 4. Someone else mentioned a UB NIU with "xcvr heartbeat" enabled as being a problem. We do have a UB NIU Buffered Repeater on our net, so I'll have to check that out. -- <---- David Herron -- The E-Mail guy <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- It takes more than a good memory to have good memories.