Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!bloom-beacon!snorkelwacker!think!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!pt.cs.cmu.edu!MATHOM.GANDALF.CS.CMU.EDU!lindsay
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay)
Newsgroups: comp.arch
Subject: Re: Fault Tolerance [LONG]
Message-ID: <7899@pt.cs.cmu.edu>
Date: 8 Feb 90 23:38:47 GMT
References: <1990Feb2.035201.21073@tandem.com> <13910014@hpisod2.HP.COM> <38@exodus.Eng.Sun.COM>
Organization: Carnegie-Mellon University, CS/RI
Lines: 40

In article <38@exodus.Eng.Sun.COM> 
	rtrauben@cortex.EBay.Sun.COM (Richard Trauben) writes:
>Dan Hepner responds to a thread about redundant mass-store and datacom
>requests wrt rolling back to a checkpoint after a PE-pair failure:
>>> The IO request atomicity can be addressed as part of the problem of 
>>> checkpoint atomicity. Once the atomic checkpoint mechanism is developed, 
>>> the initiation of IO requests can be incorporated, so that the initiation 
>>> of an IO request happens only at the time of a successful checkpoint.
>>> From the recovery processor's point of view, either the checkpoint/
>>> IO request happened or it didn't, and that is discernible.
>
>A consequence of what you suggest is that a unique checkpoint must 
>exist for every packet in a duplex conversation (over a link) where there
>are dependencies between talker and listener (debit/credit): as in 
>one checkpoint per TCP/IP or X.25 packet. 


The checkpointing systems that I'm aware of, do not perform a
checkpoint on every IO. Instead, they treat IO as a form of message
traffic.  Whenever a process receives a message (does a read), a copy
of the message is also put in a special queue. When the process is
checkpointed, the queue is cleared. So, yes, there is an overhead per
application-level IO operation. But, no, the overhead is not a
complete checkpoint. In the case of a read from a read-only file, I
suppose that the "message" could be a description of the read
request, instead of being a copy of the actual data.

Reliability is never without a price, but the price can be a lot
lower in selected cases. For example: just ask the customer to try
again. Also, "end to end" is a more general concept than some people
seem to think. Suppose that a salesman sends orders to a central
system, but also keeps a copy in his local machine.  At intervals,
the salesman can have his machine prepare a summary, compress it, and
send it in when the telephone rates are low.  The central system can
use late-night cycles to check summaries against the online data.
This sort of lazy checksumming is really cheap, and _eventually_ the
files are as correct as any other method could get them.

-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science