Path: utzoo!utgpu!watmath!iuvax!mailrus!cornell!ken
From: ken@gvax.cs.cornell.edu (Ken Birman)
Newsgroups: comp.sys.isis
Subject: Re: Recommended values for restart timeout parameters?
Message-ID: <34385@cornell.UUCP>
Date: 17 Nov 89 14:31:31 GMT
References: <372@catmkt.COM>
Sender: nobody@cornell.UUCP
Reply-To: ken@gvax.cs.cornell.edu (Ken Birman)
Distribution: na
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 67

In article <372@catmkt.COM> tim@capmkt.COM (tim edwards- writes:
>how tight can you reasonably set the -A param to isis and the -f param for
>protos and wtill have everything work as expected?  .... (etc) ...

I actually don't know.  I guess we need a simulator with which we could
experiment a little.

However, -f15 is probably much too small for what your group is doing.
To fill others in, Capital Market Technologies is using ISIS in a financial
trading setting (or they will if our faster broadcast is fast enough; I
guess the one in ISIS V1.3 is a bit sluggish for this setting).  They run
on a mix of SUN 2, SUN 3 and Apple Mac-2 systems under AUX and last I
heard they were planning a port of ISIS to the latter.

On such a mix you see very long timeouts from the SUN 2 systems, which
are old technology and short of memory.  -f15 is very fast for such a setting,
and in fact a previous person at CMT urged me strongly to support -f120!
I guess -f15 might work for closely matched machines with lots of memory,
though.

As for the -A value, I think -A1 or -A2 should be fine.  In ISIS, if a
site recovers unexpectedly fast, we just run the failure and recovery
protocols both at once.

Now, this leads to the strange part.  Tim mentions that he gets a lot
of "partitioned" executions (e.g. site 3 gets killed off and then restarts
quickly and comes up all by itself, not talking to sites 1, 2 and 4 that
stayed operational the whole time).

This is unusual and suggests that the root problem at CMT is actually
that the network itself is flakey.  We are working on ISIS and hope that
by mid 1990 we will be able to release a version that runs right through
partitions and heals itself automatically when the network recovers.  This
version of ISIS (V1.3 and also V2.0, when we release it) won't do that,
and hence doesn't "tolerate" partition failures.

Neither does anything else you can buy or run...

My suggestion is that you start by uncovering the cause of this frequent
communication problems: gateway that crashes, someone kicks the wire
on his/her SUN, or whatever.  Maybe there is a hardware problem here that
can be corrected?  Certainly sounds like an abnormal situation.

If not, perhaps you can run two versions of ISIS, with different
sites files, one on each "side" of the flakey line.  You would need
to partition your application itself, but the new long_haul facility
(see man spool(3)) includes a number of facilities for this, and is
being extended even as I type (Messac is adding a very fast uucp style
file copy and implementing long-haul cbcast and abcast protocols and
building several demo applications with them).  For example, one of our
users is running ISIS on LAN's in Norway, Sweden, DC, San Diego, and
elsewhere, and is using this approach to interconnect the LAN's.  But,
the application is physically partitioned as well; no process groups
try to transparently span the long-haul lines or anything.

Far in the future, ISIS might make all this transparent, but for 
short and mid-term planning you need to design with partitions in mind.

Ken

PS: We have our V2.0 broadcast protocols running solidly now at Cornell,
and the RPC timing looks quite good.  We are still tuning, but should be
able to post something on this shortly.  The reason I mention this is
to emphasize that our stress right now is on speed, with partitioning
to be addressed in 1990 after V2.0 is out.  This probably is the right
order of priorities for CMT, since V1.3 is really too slow to use in a
broadcast intensive setting like a trading room.