Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!usc!brutus.cs.uiuc.edu!apple!bionet!ames!pacbell!tandem!jimbo From: jimbo@tandem.com (Jim Lyon) Newsgroups: comp.arch Subject: Re: Fault Tolerance [LONG] Message-ID: <1990Feb2.035201.21073@tandem.com> Date: 2 Feb 90 03:52:01 GMT References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> Reply-To: jimbo@tandem (Jim Lyon) Organization: Tandem Computers, Inc. Lines: 157 - In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes: >My problem is that it WASN'T about technology, and we ought to turn >it back into a technology discussion that might be useful: > a) Various people do fault-tolerance various ways. How about > people who know posting some things to explain how they work, > and what the strengths and weaknesses of the various ways are? Now that the discussion is back to technology, I'll be happy to put in my two-cents worth. The following is NOT to be taken as a pronouncement on Tandem strategy, about which I know relatively little (and am willing to say even less). In general, the following represents merely the opinions of one lowly grunt (me). Before talking too much about fault tolerance, it is important to know a little bit about a fault model. In general, most people think of a fault in a component causing one of the following two behaviors: a) It stops dead. or b) It goes insane. The latter case is VERY difficult to deal with. People put it very much work to try to translate it into the first case. TMR schemes try to shoot the insane processor before it manages to poison the outside world. At higher levels, software is very frequently full of tests for violated assertions (which is evidence of insanity), in an attempt to kill the software component before the insanity spreads. In the latter case, one is not always 100% successful. Hardware faults are frequently classed as either transient (a cosmic ray flipped a bit in memory) or hard (a transister is broken and a bit in memory will always return zero). Software faults are harder. They are frequently classed as either Bohr bugs or Heisenbugs. A Bohr bug is a deterministic bug (every time I try to run hack, the system fails). A Heisenbug, on the other hand, is a nondeterministic bug (if an interrupt occurs in a particularly sensitive part of code, we'll corrupt a data structure so that the next user of the data structure will die). This breakdown also applies to hardware, but people don't usually talk about it in that context (I don't know why). A good QA process will catch nearly 100% of the Bohr bugs and many, but by no means all, of the Heisenbugs in a product. It is realistic to expect that a released product have more Heisenbugs left than Bohr bugs. These days, a typical complex system is built in lots of layers. The reliability of the system is the product of the reliabilities of each of the layers [eg, hardware, microcode, operating system, database, application, etc]. In a normal world, a failure in one layer will cause the immediately higher layer to either: a) notice the failure and correct for it, or b) notice the failure and throw up its hands (because it doesn't know how to correct for it), or c) fail to notice the failure, thereby failing itself. In case (a), that's the end of it. Your system has successfully tolerated a fault. In cases (b) and (c), the failure in one layer has just been translated to the failure of the next layer up, and we need to repeat. Notice that in most systems, the uppermost layer is a person (liveware?). So, a failure at a low layer propogates up and up, until we find a layer smart enough to deal with the failure. However, not every failure starts at the bottom. Operating systems fail. Device drivers have bugs. Database managers have bugs. Applications have bugs. All of this having been said, the question still remains: Why is checkpointing good/bad? The prime virtue of checkpointing is that you can do it again at each layer. Conceptually, you introduce a new layer between each of your previous layers. Where before, layer 3 made direct requests of layer 2, now we have layer 3 use a new layer 2a. Layer 2a maintains interfaces to two (or more) instances of layer 2. Should one instance of layer 2 fail, layer 2a transparently redirects requests to the other instance of layer 2. The client, layer 3, never sees the failure. Of course, this gives rise to a couple of requirements: a) All of the requests to a replicated layer must be idempotent. If I ask an instance of layer 2 to debit my bank account by $100, and if fails after doing so but before reporting success, I don't want the other instance to debit another $100. There are well-known schemes (using sequence numbers) to turn non-idempotent requests into idempotent ones. b) If a layer maintains state about its clients, this state needs to be kept synchronized among the various instances of that layer. Typically, they do this by informing each other whenever they change their state. This is what we call a checkpoint. If we do this replication and checkpointing at every layer of the system, we can acheive very high reliability. It turns out that we can mask all of the single hardware failures (both transient and hard), and well as most of the software Heisenbugs. This technique does not mask the Bohr bugs; if a layer contains a Bohr bug such that a certain request causes an instance of that layer to fail, then each instance of that layer will end up failing, one at a time. [One of these days I'll send something to alt.computers.folklore about the bug that caused 34 CPUs to fail sequentially, at 4-second intervals.] In summary, checkpointing not only allows you to survive most of your hardware failures, but also most of your operating system bugs, most of your database manager bugs, most of your communication protocol bugs, most of your transaction manager bugs, and even many of your application bugs. So, why doesn't everybody use checkpointing, all of the time? In particular, why didn't Tandem use checkpointing in the S2? Well, ... 1) It's hard. No doubt about it, it's frequently twice as hard to design a piece of software with checkpointing as it is without it. 2) If you already have a piece of software that was designed without checkpointing, it's VERY hard to add it as an afterthought. 3) If you insist on retrofitting checkpointing into something that wasn't designed with it in mind, you are likely to see VERY poor performance. Remember that the Tandem S2's mission in life is to run Unix. If you want a machine to run Unix, you don't do checkpointing. If you really wanted to, you could, with a huge amount of work, put checkpointing into the Unix kernel. You couldn't, even if you wanted to, manage to put checkpointing into any significant fraction of the third-party software (like database mangers, bizarre comm managers, applications, etc.). So, the amount of reliability that you could add via checkpointing is exactly that you could mask some of the Heisenbugs in the kernel. There just aren't that many there. So, what DO you do if you want a high-reliability Unix system? You: a) Use TMR on the processor and memory. We've just tolerated all of the single faults from these components. b) Duplex the disks. We're now in a position to tolerate hard disk errors. c) Beef up the device drivers. A large number of the panics that Unix systems experience are directly traceable to a transient error of one sort or another on a device. Put in code to recover from these errors. Use some aggresive test strategies to make sure that this code actually works. d) Clean up a small number of other places where the kernel just gives up (primarily due to resource exhaustion). The result is a system which: a) Isn't perfect. b) Will fail far less often than a conventional Unix system. SUMMARY: If you want the highest degree of fault tolerance possible, design it from the start to use checkpointing [If you come work for Tandem for a few years, you'll learn how]. If you can't design it [or redesign it] from the start, don't use checkpointing. Depending on where the real reasons for failure are, you may or may not benefit from running it on a system that uses checkpointing at a lower level. I hope this has been informative and hasn't sounded too much like a Tandem commercial. If not, well, I'll put on my asbestos suit now. -- Jim Lyon -- Tandem Computers -- jimbo@tandem.com