Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!zephyr.ens.tek.com!tektronix!sequent!sweiger From: sweiger@sequent.UUCP (Mark Sweiger) Newsgroups: comp.databases Subject: Re: Fault Tolerance vs. High Availability Message-ID: <40213@sequent.UUCP> Date: 7 Aug 90 22:19:10 GMT References: <2060004@hpcuhc.HP.COM> Reply-To: sweiger@crg3.UUCP (Mark Sweiger) Organization: Sequent Computer Systems, Inc Lines: 50 Tandem computer systems have achieved the laudable goal of both hardware and software fault-tolerance, meaning that a transaction will continue in the face of any *single* point of failure, whether that failure is in hardware or in software. Stratus computers achieve (at least) hardware fault tolerance by replicating hardware so that any given hardware component can fail without effecting uptime and (presumably) throughput. (It is not clear to me what kind of software fault tolerance Stratus RDBMS offerings possess. Can some type of hardware failure (power failure?) cause an in-progress transaction to be aborted, despite the redundant hardware? Does the hardware fault tolerance preclude the need for some software fault tolerance (hard to believe it does, what happens if a disk fails in the middle of a transaction's multi-page disk write, for example.)) Finally, Sequoia offers a fault tolerant Unix implementation. I don't know much about this one at all; does anyone have a thumbnail description out there? How is fault tolerance spread between hardware and software? One interesting observation about the different approaches to fault tolerance is that Tandem's fault-tolerant implementation seems much more software-based than the Stratus implementation. Tandem's Non-Stop systems typically have only one additional component of each type (with the exception of disk drives) and the fault tolerance is built mostly in the software. Stratus, on the other hand, has fully redundant hardware throughout. It seems that fully redundant hardware can substitute up to a point for a more difficult (Tandem) fault-tolerant software implementation, but if the power fails, it seems that you still need to have at least logging and recovery software to maintain data integrity. Given that, does hardware redundancy really give something more akin to high availability, rather than non-stop fault tolerance? And then there is the claim of high availability, made by many vendors, especially those without fault tolerance. What does high availability mean? From what I have been able to figure out, high availability means non-redundant hardware components like memory, CPU, and bus with very long mean times between failure. Also, dual-ported disk drives with mirroring capability. And some RDBMS subsystem with logging and recovery. Some vendors offer battery-backed-up memory. (What happens to these systems when power fails and transaction writes are in progress? Can RDBMS recovery really be avoided upon warm recovery? Seems unlikely. If recovery can't be avoided, what good is battery backup?) Are other features required for high availability? -- Mark Sweiger Sequent Computer Systems Database Software Engineer 15450 SW Koll Parkway Beaverton, Oregon 97006-6063 (503)526-4329 ...{tektronix,ogicse}!sequent!sweiger