Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!usc!brutus.cs.uiuc.edu!apple!bionet!ames!pacbell!tandem!jimbo
From: jimbo@tandem.com (Jim Lyon)
Newsgroups: comp.arch
Subject: Re: Fault Tolerance [LONG]
Message-ID: <1990Feb2.035201.21073@tandem.com>
Date: 2 Feb 90 03:52:01 GMT
References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM>
Reply-To: jimbo@tandem (Jim Lyon)
Organization: Tandem Computers, Inc.
Lines: 157

-
In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>My problem is that it WASN'T about technology, and we ought to turn
>it back into a technology discussion that might be useful:
>        a) Various people do fault-tolerance various ways.  How about
>        people who know posting some things to explain how they work,
>        and what the strengths and weaknesses of the various ways are?

Now that the discussion is back to technology, I'll be happy to put
in my two-cents worth.  The following is NOT to be taken as a pronouncement
on Tandem strategy, about which I know relatively little (and am willing
to say even less).  In general, the following represents merely the
opinions of one lowly grunt (me).

Before talking too much about fault tolerance, it is important to know
a little bit about a fault model.  In general, most people think of a
fault in a component causing one of the following two behaviors:
a)  It stops dead.  or
b)  It goes insane.
The latter case is VERY difficult to deal with.  People put it very
much work to try to translate it into the first case.  TMR schemes
try to shoot the insane processor before it manages to poison the
outside world.  At higher levels, software is very frequently full
of tests for violated assertions (which is evidence of insanity), in
an attempt to kill the software component before the insanity spreads.
In the latter case, one is not always 100% successful.

Hardware faults are frequently classed as either transient (a cosmic
ray flipped a bit in memory) or hard (a transister is broken and a
bit in memory will always return zero).

Software faults are harder.  They are frequently classed as either
Bohr bugs or Heisenbugs.  A Bohr bug is a deterministic bug (every
time I try to run hack, the system fails).  A Heisenbug, on the other
hand, is a nondeterministic bug (if an interrupt occurs in a
particularly sensitive part of code, we'll corrupt a data structure
so that the next user of the data structure will die).  This breakdown
also applies to hardware, but people don't usually talk about it in
that context (I don't know why).

A good QA process will catch nearly 100% of the Bohr bugs and many,
but by no means all, of the Heisenbugs in a product.  It is realistic
to expect that a released product have more Heisenbugs left than
Bohr bugs.


These days, a typical complex system is built in lots of layers.  The
reliability of the system is the product of the reliabilities of each
of the layers [eg, hardware, microcode, operating system, database,
application, etc].  In a normal world, a failure in one layer will
cause the immediately higher layer to either:
a) notice the failure and correct for it, or
b) notice the failure and throw up its hands (because it doesn't
   know how to correct for it), or
c) fail to notice the failure, thereby failing itself.
In case (a), that's the end of it.  Your system has successfully tolerated
a fault.  In cases (b) and (c), the failure in one layer has just been
translated to the failure of the next layer up, and we need to repeat.
Notice that in most systems, the uppermost layer is a person (liveware?).

So, a failure at a low layer propogates up and up, until we find a layer
smart enough to deal with the failure.  However, not every failure starts
at the bottom.  Operating systems fail.  Device drivers have bugs.  Database
managers have bugs.  Applications have bugs.

All of this having been said, the question still remains:
  Why is checkpointing good/bad?

The prime virtue of checkpointing is that you can do it again at each layer.
Conceptually, you introduce a new layer between each of your previous layers.
Where before, layer 3 made direct requests of layer 2, now we have layer
3 use a new layer 2a.  Layer 2a maintains interfaces to two (or more)
instances of layer 2.  Should one instance of layer 2 fail, layer 2a
transparently redirects requests to the other instance of layer 2.  The
client, layer 3, never sees the failure.

Of course, this gives rise to a couple of requirements:
a)  All of the requests to a replicated layer must be idempotent.  If I ask
    an instance of layer 2 to debit my bank account by $100, and if fails
    after doing so but before reporting success, I don't want the other
    instance to debit another $100.  There are well-known schemes (using
    sequence numbers) to turn non-idempotent requests into idempotent ones.
b)  If a layer maintains state about its clients, this state needs to be
    kept synchronized among the various instances of that layer.  Typically,
    they do this by informing each other whenever they change their state.
    This is what we call a checkpoint.

If we do this replication and checkpointing at every layer of the system,
we can acheive very high reliability.  It turns out that we can mask
all of the single hardware failures (both transient and hard), and well
as most of the software Heisenbugs.  This technique does not mask the
Bohr bugs; if a layer contains a Bohr bug such that a certain request
causes an instance of that layer to fail, then each instance of that
layer will end up failing, one at a time.  [One of these days I'll send
something to alt.computers.folklore about the bug that caused 34 CPUs
to fail sequentially, at 4-second intervals.]

In summary, checkpointing not only allows you to survive most of your
hardware failures, but also most of your operating system bugs, most
of your database manager bugs, most of your communication protocol bugs,
most of your transaction manager bugs, and even many of your application
bugs.

So, why doesn't everybody use checkpointing, all of the time?  In
particular, why didn't Tandem use checkpointing in the S2?

Well, ...
1)  It's hard.  No doubt about it, it's frequently twice as hard to design
    a piece of software with checkpointing as it is without it.
2)  If you already have a piece of software that was designed without
    checkpointing, it's VERY hard to add it as an afterthought.
3)  If you insist on retrofitting checkpointing into something that wasn't
    designed with it in mind, you are likely to see VERY poor performance.

Remember that the Tandem S2's mission in life is to run Unix.
If you want a machine to run Unix, you don't do checkpointing.  If
you really wanted to, you could, with a huge amount of work, put
checkpointing into the Unix kernel.  You couldn't, even if you wanted
to, manage to put checkpointing into any significant fraction of the
third-party software (like database mangers, bizarre comm managers,
applications, etc.).  So, the amount of reliability that you could
add via checkpointing is exactly that you could mask some of the
Heisenbugs in the kernel.  There just aren't that many there.

So, what DO you do if you want a high-reliability Unix system?
You:
a)  Use TMR on the processor and memory.  We've just tolerated all of
    the single faults from these components.
b)  Duplex the disks.  We're now in a position to tolerate hard disk errors.
c)  Beef up the device drivers.  A large number of the panics that Unix
    systems experience are directly traceable to a transient error of
    one sort or another on a device.  Put in code to recover from these
    errors.  Use some aggresive test strategies to make sure that this
    code actually works.
d)  Clean up a small number of other places where the kernel just gives
    up (primarily due to resource exhaustion).
The result is a system which:
a)  Isn't perfect.
b)  Will fail far less often than a conventional Unix system.

SUMMARY:

If you want the highest degree of fault tolerance possible, design it
from the start to use checkpointing [If you come work for Tandem for
a few years, you'll learn how].

If you can't design it [or redesign it] from the start, don't use
checkpointing.  Depending on where the real reasons for failure are,
you may or may not benefit from running it on a system that uses
checkpointing at a lower level.

I hope this has been informative and hasn't sounded too much like a
Tandem commercial.  If not, well, I'll put on my asbestos suit now.

-- Jim Lyon
-- Tandem Computers
-- jimbo@tandem.com