Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!tut.cis.ohio-state.edu!ucbvax!UCBCMSA.BITNET!CLIFF
From: CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360)
Newsgroups: comp.sys.proteon
Subject: Re:  4-into-6 coding, and the "clasic" pronet-80 problem
Message-ID: <9003192208.AA25071@devvax.TN.CORNELL.EDU>
Date: 19 Mar 90 22:08:00 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 125

Hi,
We have had a fair amount of experience with this problem here at UC
Berkeley, and I think we have essentially banished it in this form.
With Proteon's help, you should be able to also.

> Does anyone know what 33Hex maps into under the 4-into-6 bit coding
> used by Proteon?  etc...

Hex 33 is useful because it maps to the ascii alpha character "3", so
you can easily fill a file with the letter "3" (but don't put too many
newlines or carriage returns in).  Hex 33 maps into: 100011100011 which
when followed by another hex 33 becomes a series of 3 zeros followed by
3 ones.  I believe it is the lack of transitions over 3 bits that is
hard for controllers that are drifting out of spec.

There are several other data patterns that are at least as bad as this,
hex: 36, 63, 66 BE, BB, EB, EE, and undoubtedly more.

> Can anyone offer a technical explanation of the situation?  Is it that
> the "rf" (120mbps) stages become miss-tuned, ...etc?

Well, I'm a software kind of guy, and our hardware techs sometimes use
the phrase "programmer with a screwdriver" in a sarcastic way, so
take what I say with some sized grain of salt.  ;-)

Each active device on the p80 ring reads the data that comes in using
its own clock to decode it.  If the data is for a node downstream, the
device regenerates it, again using its own clock.  This means that
all the devices on the ring had better have clocks that are in close
alignment with eachother.  The clocks are all supposed to be at 120Mhz
+/- a tiny fraction (10Khz?).  These clocks tick totally independently
of eachother, there is no "master" clock.

This design appears (to me) to lead to some difficult debugging
situations.  You can have a ring that is working ok but has some clocks
at the ragged edge, introduce a new node and all of a sudden your
ring is shot.  The new node may actually be OK, but you might "fix" the
problems by putting in a different controller.  Or you might "fix"
the problems by plugging the controllers in in a different order.

P3280s seem to have the worst problems.  Maybe it's because they have
two independent clocks, or maybe because they get too hot in their
little boxes or maybe their circuitry is really different (big help,
huh?).

> Has anyone found a way to "help" the situation?
>> What is the quick and effective way to find which p3280 or CTL
>> card among the many on the ring already out of alignment?

The only way I know how to deal with this requires real work, but
it is what you have to do:

1)  First you have to determine what order the nodes are in the ring.
This is crucial because of the way the data is clocked and regenerated
by each node.  In order to pinpoint a problem node you have to know
the exact path that data will take through your ring.

    To do this, you go and look at your wire center.  Data will flow
around it in a counter-clockwise direction.

IMPORTANT:  You have to realize that at the link level each packet is
going to go all the way around the ring.  Node A sends it to node B,
and if all goes well node B sends it back with the ACK bit set.  If
all doesn't go well (either the ACK bit is off or the packet is
trashed), node A will retransmit the packet (up to several times).

You need to keep this in mind.  This is the root mechanism that causes
duplicate packets to show up.  Also, if the path from B to A is bad,
A will spend a certain amount of time retransmitting unnecessarily
and this will slow down throughput from A to B--although not nearly
as much as from B to A.

2)  Next you have to have a way to test each node.  Let's say you have
p4200 routers which have a p80 interface and some others, say an ethernet.
You need access to one of the ethernets from each router.

What you do is ship data across the your ring.  From point A to point B
you ship (eg) a file with nothing but 3's in it.  Then you ship the same
size file with 1's in it (1's are inocuous).  Then do the same tests
from B to A.

-If the 3's are causing problems, you will see very different
throughput rates.
-If there is only one broken node in the ring you will see that the
throughput for the 3's file is dramatically worse in one direction
than the other.
-If there are several broken nodes in the ring you have a much more
difficult hunt, but you can USUALLY get pretty far if not all the way.
I've seen some strange things with this.  Sometimes I've had to
reorder things in the ring to find a bad component.

3)  If you note any funnyness across your p3280 links get your
p3280s upgraded to the latest revs.  We have not had this problem
with our p3280s since we did this.  (We have had a couple of total
failures, but that is at least pretty easily identifiable.)

=====
I have some tools that can help.  They are available for anonymous
ftp from jade.berkeley.edu (128.32.136.9).

1)  pub/ping.c and pub/ping.8:  This lets you specify the data fill
problem for the packets sent.  This helps you spot the problem early.
Since each ping packet goes in both directions it is no help in
pinpointing the problem.

2)  pub/netout.c:  This sends data to the TCP discard port of a remote
machine.  You can specify the data fill pattern.  This is easier to
use for pinpointing things than ftp, since you don't need an account
on the remote host.  Unfortunately, not everybody has implemented the
TCP discard port code.

=====
We can identify when we are starting to have problems in a couple
of ways.  One is from SNMP collected output errors on the p80 ring
interfaces.  Another is looking at "T 2" in the router consoles and
seeing lots of 8704 errors on the p80 interfaces.  "Lots" is
defined very fuzzily in my mind--it's based on experience...

I don't mind discussing these problems with folks.  I hope this is
helpful to someone, my hands are tired.  ;-)

        Cliff Frost                   (415) 642-5360
        Central Computing Services    <cliff@berkeley.edu>
        University of California      CLIFF AT UCBCMSA
        Berkeley, CA 94720