Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) Newsgroups: comp.arch Subject: Re: Fault Tolerance Message-ID: <7840@pt.cs.cmu.edu> Date: 5 Feb 90 16:14:06 GMT References: <13910004@hpisod2.HP.COM> <13910009@hpisod2.HP.COM> <35300@mips.mips.COM> <1990Feb2.035201.21073@tandem.com> Organization: Carnegie-Mellon University, CS/RI Lines: 46 In article <1990Feb2.035201.21073@tandem.com> jimbo@tandem (Jim Lyon) writes: >TMR schemes >try to shoot the insane processor before it manages to poison the >outside world. Ah, TMR?? Thread Maintenance and Repair??? Test, Monitor, Recovery??? "Shooting" of course implies that there is a way for healthy machines to do things to insane systems. So, not only does one have "are you alive" messages, one also has "die" messages. Does Tandem try to keep insane machines from sending this message? >In summary, checkpointing not only allows you to survive most of your >hardware failures, but also most of your operating system bugs, most >of your database manager bugs, most of your communication protocol bugs, >most of your transaction manager bugs, and even many of your application >bugs. I'm impressed. That's quite a long list. >If you can't design it [or redesign it] from the start, don't use >checkpointing. Do you hold out any hope for automation, or for schemes that trade off efficiency for ease of retrofit? >So, what DO you do if you want a high-reliability Unix system? You: [list of what are basically cleanups] Yes, the press reported that Tandem's Unix had fixes in some 800 places where the kernel used to just throw up its hands. Obviously, a lot of work has been put in. What ever happened to the Auragen Unix kernel? They did checkpointing between process pairs, and synchronized them at invervals. (Each Unix signal caused a synch, because it had to interrupt both processes at exactly the same instruction.) Synchonization also involved paging out all dirty pages: certainly an argument against the VAX, which doesn't know who's dirty. I believe the Auragen people also pulled some kernel functions into server processes, where it was easier to make them survive. This makes the various kernelization projects (such as Mach) sound ever more attractive. -- Don D.C.Lindsay Carnegie Mellon Computer Science