Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!mips!apple!motcsd!hpda!hpcupt1!hpisod2!dhepner From: dhepner@hpisod2.HP.COM (Dan Hepner) Newsgroups: comp.arch Subject: Re: Fault Tolerant Micros Message-ID: <13910010@hpisod2.HP.COM> Date: 3 Feb 90 02:01:32 GMT References: <13910004@hpisod2.HP.COM> Organization: Hewlett Packard, Cupertino Lines: 129 From: mash@mips.COM (John Mashey) > > a) Various people do fault-tolerance various ways. How about > people who know posting some things to explain how they work, > and what the strengths and weaknesses of the various ways are? While various people do FT various ways, it is reasonable to decide to discuss a restricted set. If the space shuttle needs five computers, programmed by different companies using different algorithms, that's fine, but would appear beyond the scope of how FT is best done using microprocessors. The ways cleanly divide into two: checkpointing vs. redundantly executing instructions with enough processors to guarantee completion regardless of any failure. All FT schemes should include some means to detect failure, of course. "The problem" which must be solved by any FT scheme is how to get each instruction executed once and only once WRT the user. Checkpointing solves this by saving enough process state in a place available to another processor to restart the process in the event of original processor failure. Any post checkpoint execution by the failed processor is guaranteed to be abandoned. Redundant processor machines always execute the instruction with more than one processor (3 for Tandem, 4 for Stratus), and compare results. Miscomparisons result in reliable detection of "crazy processors", including cache, logic, or whatever. > b) Of particular interest in this discussion: are there features > in fault-tolerant OLTP systems that: OLTP, while clearly correlated with FT, is a separate topic. Tandem's Guardian line does seem to assume that the best reason one might need FT is for OLTP, but other markets for FT exist as well. Communications applications stick out, e.g. telecom. > a) are in UNIX > b) aren't in UNIX, but could be > c) aren't in UNIX, but would require complete rewrites > to get them there. While there are a bunch of requests for UNIX enhancements from the OLTP community, these are OLTP requests, not FT requests. There's also a bunch of requests for UNIX enhancements from people who advocate reliable OSs. Laudable no doubt, but again not FT requests. Ideally FT would exist completely in the hardware, and present a platform to the OS which looks like a non-FT machine. The reality is that this can't be quite true. FT vendors will always be required to supply whatever kernel support that their idiosyncratic implementation requires, and an OS port to such a machine will always be more difficult than on a non-FT platform. However, by and large the progress of UNIX, and the related progress of DMBS software should proceed without undue concern for the needs of FT. The ball is in the other court. FT machines should (and are) being designed with the needs of UNIX and DMBS software in mind. > d) What's the tradeoff between: > degree of software fault-tolerancy > and > ability to run standard software, with no changes "Software fault tolerance" implies that _someone_ must figure out when and what to checkpoint, or that an extreme penalty will be paid because everything is checkpointed. How much standard software has such checkpointing already programmed in? None, of course. >Anyway, I'd observe that from the publicly-available data, it is clear that >the 2 Tandem product lines don't really overlap very much, and are aimed >at different markets, for different reasons, and hence, trying to read too >much into this about the merit or lack thereof of a specific technical >feature just doesn't make sense. I guess we just come to different interpretations of publicly available, and maybe ambiguous information. It still looks to me like a technological generation change. Time will certainly tell. >the VAX meant that DEC thought PDP-11s were Wrong Things, of course I'd >have objected. Unless my memory fails, I thought DEC made something like >$1B last year on PDP-11-based products... 10 years after the introduction of >the VAX. Although the two overlapped in some areas, they didn't at all in >others. I guess what is at the core of the disagreement over what is appropriate for discussion is that many of the problems of changing from older to newer technology are universal to all successful companies, and discussing one instantiation doesn't seem directed at the subject being discussed. HP continues to sell "traditional" 16 bit HP-3000s, while having moved to 32 bit RISC. Most observers speculated when they heard the HP-PA announcement that the 16 bit architecture was being moved away from by HP, that HP believed it could do better than before. HP surely never concluded that traditional HP-3000s were "wrong things", nor of course will Tandem ever come to such a conclusion over a product which is on par with the HP-3000 for being a successful both commercially and technically. Obsolescence does not imply wrongness. As shown by the PDP-11, it doesn't imply lack of commercial success, and certainly doesn't imply lack of support. >As a matter of style, I believe that it is much better to carefully >label speculations as such, and ask questions, than to make strong-sounding >statements that can easily mislead the casual observer. I'll accept your advice, and thank you for indeed a better way of presenting the case. But I'll point out that anyone who is misled by a strong sounding statement in comp.arch is a candidate for selling a bridge to. >I have no desire to suppress such discussions, as the interactions of >technology and business are extremely important to understand. The problem is that companies, _all_ long lived technology companies, control technology for business reasons. Taking company announcements at face value is not the way to understand such interactions. >-john mashey DISCLAIMER: Dan Hepner Not a statement of Hewlett-Packard Co.