Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!decvax!decwrl!ucbvax!UTAH-20.ARPA!ZELEZNIK
From: ZELEZNIK@UTAH-20.ARPA.UUCP
Newsgroups: mod.computers.apollo
Subject: Alternate Links
Message-ID: <12221858028.8.ZELEZNIK@UTAH-20.ARPA>
Date: Fri, 11-Jul-86 12:31:18 EDT
Article-I.D.: UTAH-20.12221858028.8.ZELEZNIK
Posted: Fri Jul 11 12:31:18 1986
Date-Received: Sat, 12-Jul-86 01:12:40 EDT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 39
Approved: apollo@yale-comix.arpa

In reference to alternate links, we have developed a two fold approach to
maintaining a consistent environment (at the system, node, and user levels)
across single node failures.

First, each entry directory with "critical" data has a backup location on
some other node.  For system level directories, each backup contains only
those branches which are required to maintain the environment (e.g., /etc,
/sys/tcp, ... objects which exist in only one place).  For user login
directories, the backup is user specified, (usually top level files and
links, user_data, and personal com/bin directories).  In this way, both the
system and user environments are preserved.

A simple "node_down node" command walks the net and uncatalogs the
unavailable node, replacing it with a link to its backup location, while a
corresponding "node_up node" command undoes this action (execution of these
commands is restricted).  In this way, we forget about what lives where;
once the backup locations are established, everything is handled by simply
switching the root level entry for the unavailable node.

Second, backup node_data directories are provided to preserve the node
environment when diskless partners are down.  Each node has a primary and a
secondary paging partner.  The secondary is maintained with only the
critical files (e.g., startup?*, etc?*, tcp info, ... and any user-specific
files), with all non-essential files removed periodically.  This reduces
the backup size by more than an order of magnitude.

All the necessary syncing and such for all of this is done automatically
through scripts running under /etc/cron.  User logins, however, are left to
the user.

While far from the elegance of replicated objects, this has provided a
reasonably stable environment during single node failures, and has been
straight forward to maintain.  Contact me if anyone has more interest.

Mike Zeleznik
University of Utah
Zeleznik@Utah-20.ARPA

-------