Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!zephyr.ens.tek.com!uw-beaver!cornell!ken
From: ken@gvax.cs.cornell.edu (Ken Birman)
Newsgroups: comp.sys.isis
Subject: transport.3
Message-ID: <35855@cornell.UUCP>
Date: 10 Jan 90 18:52:21 GMT
Sender: nobody@cornell.UUCP
Reply-To: ken@cs.cornell.edu (Ken Birman)
Distribution: comp
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 333


The following is a proposed copy of the manual page for defining
new transport protocols in conjunction with the ISIS BYPASS
facility.  Questions or comments would be appreciated.  Pat and
I are debugging the interface using an ethernet multicast transport
protocol now (it will make it into ISIS V2.0).  -- Ken

.This file should be placed in isis/man/man3/transport.3
.run off using "troff -man transport.3" (or nroff)
.TH ISIS_TRANSPORT 3  "1 February 1986" ISIS "ISIS LIBRARY FUNCTIONS"
.SH NAME
isis_transport -- define new multicast transport protocol.

.SH DESCRIPTION

When compiled with the -DBYPASS flag, ISIS makes use of an
experimental mechanism that bypasses the normal ``protocols''
process whenever possible on transmission of CBCAST and FBCAST
messages.
By default, this is done using a UDP-based message delivery protocol,
but you can substitute a delivery protocol of your own if you
like.
A delivery protocol has two major features; it may accept or reject
individual messages, and messages that it has accepted must be
delivered reliably to the groups or processes addressed.
It is somewhat weaker than a transport protocol, which must also order
messages; since ISIS is ordering messages at a higher level,
the delivery protocol need not repeat this work.
If a message is rejected by a user-supplied delivery protocol, it will
be transmitted using the default protocol, which accepts all
messages.

ISIS V2.0 provides a 
facility for determining whether or not a given
destination is operational; for message fragmentation
and reassembly; and for hinting whether or not
a given message will generate a reply.
The rest of this manual page describes in detail the routines that a
given delivery protocol must provide, and the routines that
ISIS provides to assist the protocol implementor.

First, some restrictions.
The bypass protocols may only be used in situations where the sender
of a multicast belongs to the group to which it is sending.
Moreover, the protocols are very limited in the type of destination
addressing permitted.  
They are used only for multicasts to a single group and for point-to-point
messages from one group member to another, such as replies to a
group multicast.  
Any other sort of destination addressing causes ISIS to route the
multicast via the pre-existing, slower protocol suite.
This includes cases in which a message is sent by a client of a group
(see pg_client) or to a set of destinations that includes clients
of a group.
If your application needs to multicast to a group of processes and you	
want to use the bypass protocols, you will therefore need to create
a group that corresponds closely to the set of destination processes.

The BYPASS mechanism works correctly even if the processes involved
belong to multiple groups, or join and leave the same group
multiple times during an execution.
The initial implementation, on top of UDP, is already quite fast.
Using the transport mechanism, however, you can extend it to take
advantage of special hardware that might require sending less then
one message to each destination, special protocols that might
be faster than UDP by avoiding some of the work that UDP normally
does, or to support protocols with special realtime properties or
other properties that interest you.
Since our UDP protocol is oriented towards sending 8k messages
using a windowing flow control method, you would almost certainly
need to implement a different scheme for doing realtime graphics or voice,
or sending huge quanties of data.

To define a new transport protocol, you will need to provide ISIS with 3
routines: \fBmt_send\fR, \fBmt_groupview\fR and 
\fBmt_physdead\fR.
Inform ISIS of the names of these routines by calling 
\fBisis_transport(pn, mt_send, mt_groupview, mt_physdead)\fR,
where \fBpn\fR is a transport protocol number between 1 and
\fBMAXTRANSPORT-1\fR.
Transport protocol 0 is pre-defined to correspond to the ISIS UDP
protocol, and should not be changed.
On the other hand, you may find it convenient to call this protocol
for sending point-to-point messages from within your transport
code.  
We explain how to do this below.

When communication with a group becomes a possibility, ISIS will call
\fBmt_groupview(ginfo_p)\fR, giving the address of the
\fBginfo\fR structure about the group.
This routine will also be called every time the group membership
changes.
\fIIt will be called in the same order at all processes that belong
to the group. \fR
Your protocol should expect to begin receiving messages from a
process anytime after this routine is called with a groupview
containing that process.

When a process fails, ISIS goes through a two-stage sequence.
First, the system may call \fBmt_physdead(addr_p)\fR, giving the
address of that process.
It does this as a sort of a hint to your protocol because
there may be a delay before the group membership is changed,
e.g. because some messages are being flushed to ensure atomicity
of multicasts initiated by the process that crashed.
However, there are situations where this routine will not be called at all,
hence it should be treated as a hint and nothing more.
Calls to \fBmt_physdead\fR may come in {\em any\fR order at different
members.

However, ISIS will guarantee that if a member fails,
surviving members will either {\em eventually\fR see a call to
\fBmt_physdead\fR for this member, or a call to 
\fBmt_groupview\fR with a view that does not contain that member.
And, calls to \fBmt_groupview\fR are done in the same order everywhere.
This is a property on which your transport protocol can depend.

Your protocol may sometimes detect apparent failures.  
ISIS does not allow you to act on such failures directly, since
you could be wrong.  However, it does provide a way for
your code to encourage ISIS to check the status of a
recalcitrant process.
This is done via the routine \fBisis_querydead(addr_p)\fR.

\fBisis_querydead(addr_p)\fR
returns 1 if ISIS believes that this member is dead, and 0
otherwise.
That is, the routine keeps track of process status using
\fBmt_physdead(addr_p)\fR and \fBmt_groupview\fR.
In addition, when you call this routine and before it returns a value,
ISIS will probe the inicated process to see if it is responsive.
Thus, if 0 is returned, the process actually responded to a probe
message {\em after\fR you detected its apparent failure.
Probing may take a while; in the worst case ISIS could run its site-level
failure detector, which could require a minute or two to complete.

When a process is declared dead, either from \fBmt_groupview\fR, 
\fBmt_physdead\fR or \fBisis_querydead\fR, it should be treated
like a sink: all messages to that process (if any) should be
discarded and ISIS should be told that any pending messages
(and any future attempts to send) have terminated.

The basic transport communication interface is through the send routine.
Calls to \fBmt_send\fR have the following interface:
.nf
mt_send(gaddr, exmode, mp, to, callback, arg0, arg1)
address *gaddr;
register message *mp;
register address *to;
int (*callback)();
char *arg0, *arg1;
{
....
}
.fi

Basically, a call to \fBmt_send\fR requests that \fBmp\fR be
transmitted to the destinations in \fBto\fR and that the 
\fBcallback\fR routine be invoked when the message is known to have
been delivered or the destination is dead.
The destinations in {\to\fR are guaranteed to be a proper subset of the
members and clients of the group.
(Your protocol may or may not take advantage of this - ISIS will throw
away surplus messages).
If the callback routine is not specified as a null pointer, you
should do the callback separately for each destination, as
follows: \fB(*callback)(addr_p, arg0, arg1);\fR, where 
\fBaddr_p\fR is a pointer to the address of the destination in
question.
It is important that you not do the callback until the messages
have reached their remote destinations safely, as this is one of the
tools ISIS uses to decide how long to keep spare copies of a message
to ensure atomicity after failures.

Your protocol may reject a request to send a message, by returning -1; it
should return 0 if the message was accepted.  
In the former case, ISIS will transmit the message in question using transport
protocol 0.
The protocol 0 transport routine is called \fBnet_send\fR; 
it transmits messages using the UDP packet protocol.
You are free to call \fBnet_send \fR if your protocol has a need
for reliable point-to-point messaging.
However, be aware that \fBnet_send\fR uses acknowledgement packets to
ensure that delivery is reliable.
There is no unacknowledged version of the \fBnet_send\fR protocol.
This means that \fBnet_send\fR is not a particularly good way
to send acknowledgement packets needed by your own protocol, unless
you want to be absolutely sure they reach their destination.
(Using an acknowledged protocol to send the acks for protocol \fBn\fR
could effectively double its acknowledgement traffic).

In general, \fBto\fR may include the address of the {\em sender\fR.
That is, for some messages, there will be an address in this
null-terminated list for which \fBaddr_ismine()\fR returns TRUE.
The \fBexmode\fR flag is set to 1 if this copy of the message {\em
should be ignored\fR.
In this case, you should transmit the packet to all addresses except this
one.
On the other hand, if \fBexmode\fR is 0, this ``local'' copy of the message
can be delivered whenever your protocol is ready to do soa
(immediately if you like) by calling \fBisis_receipt(mp, &my_address, pn).
Here, \fBpn\fR is the protocol transport number you picked for your
protocol.

Notice that if \fBexmode\fR is set to 1 and \fBto\fR only lists one process
for which \fBaddr_ismine\fR returns true, your protocol will not
need to do any work, but must call the
callback routine if the pointer is non-null.

In the case of a receipt of a message from some remote
destination, call \fBisis_receipt(mp, addr_p, pn)\fR,
specifying the address from which the message arrived and the
protocol transport number that received it.
\fBisis_receipt\fR will put messages in sequence and detect and reject
duplicates, so your protocol need not worry about doing this.
The application will get stuck, however, if your protocol accepts a message
but never gets around to delivering it at some destination to which
the message is addressed.
The client dump contains enough information to figure out that this has
happened, but you need to suspect the problem to know where to look.

ISIS provides several routines that we find helpful in designing transport
protocols.
For example, your protocol may need to send a message to acknowledge receipt
of message \fBmp\fR.  
Should it send the acknowledgement immediately, or wait a little while
in the hope that some other message will be sent back to the
sender of \fBmp\fR?
Obviously, ISIS can't predict the future, but it can tell if the sender
of \fBmp\fR is waiting for one or more replies.  
If so, your protocol might want to wait for the routine that \fBmp\fR
is delivered to to run, in the hope that it will generate such a
reply (if a multicast is received using protocol \fBpn\fR, any
reply to it will be presented first to the \fBmt_send\fR routine
for protocol \fBpn\fR).
The predicate \fBMAY_REPLY(mp)\fR tells if reply to \fBmp\fR is possible.
If this predicate is false, the message will not generate a reply.
If this predicate is true, the message might generate a reply fairly
soon. 
This is a hint to the delivery protocol not to send
an acknowledgement immediately, as there is a chance that the 
acknowledgement can be piggybacked on the reply message.

But, how long should your protocol wait?  
After all, \fBMAY_REPLY(mp)\fR only gives a hint, and perhaps no reply
will be sent!
To overcome this problem, ISIS provies a routine \fBt_suspend\fR.
If your protocol's receiving task does a \fBt_suspend()\fR,  ISIS will
deliver the message \fBmp\fR.
If a reply gets sent immediately, your \fBmt_send\fR routine will be
called while the receiving task is still suspended.
If no reply is sent, or \fBisis_receipt\fR can't deliver \fBmp\fR
promptly, \fBt_suspend\fR will return and you can send the
acknowledgement as a separate packet.
Note that \fBt_suspend\fR returns no indication of what happened, you
are expected to keep track of this yourself using global flags.
The \fBMAY_REPLY\fR predicate is uniformly true or false at all processors
that receive a given message.

One easy mistake is to forget that \fBmsg_read\fR creates a message.
Don't forget to do a \fBmsg_delete\fR after your code has finished
with such a message, or with one extracted from inside another
message using the \fB%m\fR format item.

Your protocol may have a size limit on messages.  
To check the size of a message, call \fBmsg_getlen(mp)\fR.
If a message is too long for your protocol, you may call 
\fBisis_fragment(size, mt_send, gaddr, exmode, mp, to, callback, arg0,
arg1) \fR.
This routine will repeatedly call the specified \fBmt_send\fR routine 
with fragments of the message pointed to by \fBmp\fR that are
no larger than \fBsize\fR.
The remaining arguments to \fBmt_fragment\fR will be passed to the
send routine unchanged.
The \fBgaddr\fR, \fBexmode\fR, and {\to\fR arguemnts will be passed to
the send routine unchanged.
However, the callback routine will be passed as a null pointer on all but
the last fragment of the message.
This way, the user-supplied callback will not be called until after all
fragments of the message have been delivered.

ISIS will always use protocol 0 by default.
To convince ISIS to use your protocol, use \fBmbcast_l\fR, \fBfbcast_l\fR,
\fBcbcast_l\fR or \fBabcast_l\fR,
specifying the option \fB``Pn''\fR where \fBn\fR is the protocol
transport number you chose.
The mbcast protocol gives FIFO ordering (like FBCAST) but might not be
atomic in the event that the sender fails during transmission.
The other protocols are atomic and
give FIFO delivery order, CBCAST order, and ABCAST order
respectively, and each is more costly than the preceding one.
Also, the others are ordered with respect to GBCAST invocations, while
\fBmbcast_l\fR might not be.

\fBmbcast_l\fR 
gives the very fastest possible multicast in ISIS, short of calling
your transport protocol directly.

Note that ISIS will not use your protocol if you don't obey
the various restrictions on destination address.  
In such cases, the multicast will work using the old, slow ISIS mechanism.

We expect that support for external delivery protocols will gradually
improve in future releases of ISIS.
One facility we are considering will allow a protocol to learn something
about network topology at runtime, for example to determine if
all of a set of processes are on the same ethernet or token ring.

Recall that -DBYPASS is specified when you compile ISIS.  In fact, this
flag only affects two modules of the ISIS client library, clib.
The rest of ISIS is the same whether or not BYPASS is used.
In fact, if you don't compile ISIS with -DBYPASS, the system
never uses the new BYPASS code at all.

You can mix code linked to a clib compiled with -DBYPASS and code compiled
without this, but only if all members of a process group use the
same rule.
Other mixtures will hang, e.g. if a process linked to a library built with
-DBYPASS tries to join a group linked without BYPASS.

Because the BYPASS facility is experimental, it may have bugs that will
render ISIS less reliable.  
We believe that with BYPASS disabled, ISIS V2.0 is the most robust
version of ISIS yet released, but with it, problems will be
likely for at least a few months.  
If performance is not an issue for your application but reliability is
crutial, you may wish to compile and link most applications with
a version of clib for which BYPASS is not enabled, using the
BYPASS version only in isolated applications for which performance
is especially important.  
However, this requires multiple copies of the client library routines and
hence might be a bit hard to administer if your group is not very
``sophisticated''.