Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!zephyr.ens.tek.com!tektronix!sequent!sweiger
From: sweiger@sequent.UUCP (Mark Sweiger)
Newsgroups: comp.databases
Subject: Re: Fault Tolerance vs. High Availability
Message-ID: <40213@sequent.UUCP>
Date: 7 Aug 90 22:19:10 GMT
References: <2060004@hpcuhc.HP.COM>
Reply-To: sweiger@crg3.UUCP (Mark Sweiger)
Organization: Sequent Computer Systems, Inc
Lines: 50

Tandem computer systems have achieved the laudable goal of both
hardware and software fault-tolerance, meaning that a transaction
will continue in the face of any *single* point of failure, whether
that failure is in hardware or in software.  Stratus computers
achieve (at least) hardware fault tolerance by replicating hardware
so that any given hardware component can fail without effecting uptime
and (presumably) throughput.  (It is not clear to me what kind
of software fault tolerance Stratus RDBMS offerings possess.
Can some type of hardware failure (power failure?) 
cause an in-progress transaction to be aborted, 
despite the redundant hardware?  Does the hardware fault tolerance
preclude the need for some software fault tolerance (hard to believe
it does, what happens if a disk fails in the middle of a transaction's
multi-page disk write, for example.))  Finally, Sequoia offers a fault
tolerant Unix implementation.  I don't know much about this one at
all; does anyone have a thumbnail description out there?  How
is fault tolerance spread between hardware and software?

One interesting observation about the different approaches
to fault tolerance is that Tandem's fault-tolerant 
implementation seems much more software-based than the Stratus 
implementation.  Tandem's Non-Stop systems typically have
only one additional component of each type (with the exception of disk
drives) and the fault tolerance is built mostly in the software.
Stratus, on the other hand, has fully redundant hardware throughout.
It seems that fully redundant hardware can substitute 
up to a point for a more difficult (Tandem) fault-tolerant 
software implementation, but if the power fails, it seems that you still
need to have at least logging and recovery software to maintain
data integrity.  Given that, does hardware redundancy really
give something more akin to high availability, rather than 
non-stop fault tolerance?

And then there is the claim of high availability, made by many
vendors, especially those without fault tolerance.  What does high
availability mean?  From what I have been able to figure out,
high availability means non-redundant hardware components like
memory, CPU, and bus with very long mean times between failure.
Also, dual-ported disk drives with mirroring capability.  And some RDBMS
subsystem with logging and recovery.  Some vendors offer battery-backed-up
memory.  (What happens to these systems when power fails and
transaction writes are in progress?  Can RDBMS recovery really be
avoided upon warm recovery?  Seems unlikely.  If recovery can't
be avoided, what good is battery backup?)   Are other features
required for high availability?
-- 
Mark Sweiger			Sequent Computer Systems
Database Software Engineer	15450 SW Koll Parkway
				Beaverton, Oregon  97006-6063
(503)526-4329			...{tektronix,ogicse}!sequent!sweiger