Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!daemon
From: thomson@hub.toronto.edu (Brian Thomson)
Newsgroups: comp.protocols.tcp-ip
Subject: Re: more on Fletcher
Message-ID: <8912071916.AA23996@beaches.hub.toronto.edu>
Date: 7 Dec 89 19:18:33 GMT
References: <8912060542.AA05854@WLV.IMSD.CONTEL.COM> <1989Dec6.190415.19049@brutus.cs.uiuc.edu> <18511@bellcore.bellcore.com>
Sender: <daemon>
Organization: University of Toronto
Lines: 97

In article <18511@bellcore.bellcore.com> karn@jupiter.bellcore.com (Phil R. Karn) writes:
>... after someone else first writes:
>>The TCP checksum is an end-to-end checksum.  Saying
>>"it's a hardware problem" is a cop-out...
>
>I am a big believer in the end-to-end argument.
>

Many people appear to subscribe to "the end to end argument".
I don't know if they have been convinced by persuasive argument or by
the eminence of others who have endorsed it.  The arguments I have
heard do not convince me, and I believe that in many instances data
integrity can and should be a hardware problem.

Saltzer, Reed, and Clark argued the end-to-end case in their paper
in the November '84 TOCS,  They offer a file transfer application as
their paradigm, and note five threats to the integrity of the data stored
at the receiving host:
    1) Undetected disk or disk controller error on the sending
       host while reading the original file.
    2) Software error in either host or in communications system.
    3) Undetected processor or memory hardware error in either
       end system.
    4) Communications error.
    5) Host crashes during transfer.
They then claim that a harware-implemented communications checksum only
addresses fault 4, and that
	"the careful file transfer application must still counter the
	 remaining threats; so it should still provide its own retries
	 based on an end-to-end checksum of the file."
Perhaps a careful file transfer application should, but to conclude by
extension that all applications must is unwarranted.

In counter-argument, note that the first three risks are not unique to
communications systems, and will be faced by a careful single-host
disk-to-disk file copying program.  Does this mean that every disk block
on your computer system contains a software generated checksum? 
Undoubtedly there are applications that do this, but in most cases
the prevailing industry standard of reliability, enhanced as
it might be by embedded parity/EDC/ECC hardware, is sufficient.

The particular flavour of risk 5 encountered here is unique to distributed
systems, but its detection involves handshaking and does not require
"an end-to-end checksum of the file."

A related end-to-end argument, enunciated by Cheriton in his XXX
VMTP paper, states that only the application knows how small an error
probability is tolerable, so the application must implement the checksum.
This argument is incorrect, firstly because checksumming can not establish
a maximum error probability, it can only reduce it.  An application may be
willing to tolerate a 10^-18 error rate, but it cannot determine how strong
an error detection scheme to use unless it also knows what error rate it is
starting with.  Secondly, different error detection methods have different
strengths and weaknesses, and an appropriate selection may require
knowledge of the nature of the most probable errors.

In my view, data integrity can be effectively provided by chaining
together reliable links, and if this is done properly then no end-to-end
check is necessary.  "Doing it properly" means providing a comparable 
input-to-output error rate as you get from your favourite disk
drive/controller combination, or tape/formatter combination.
That likely means some error detection (or maybe correction, for something
like a satellite channel) "on the wire", and maybe some parity or
something even fancier in the communications processor memory, and maybe
you need to ensure that the domains of protection of these things overlap
so that errors don't sneak in through the cracks.  Check out what
Ciprico/Xylogics/etc. do with the caches on their intelligent disk
controllers.  Provide the same level of service they do, it seems to
be adequate for most applications.  Who knows, you might even be able to
write applications that don't have to be told whether they are dealing with
disks or networks, becuase they same de facto reliability standard
applies to both.

When does this approach fail?  Obviously, if your packets are going
over a network you know nothing about, or have no control over, it is not
safe to make assumptions about its error properties.  In that case,
higher-level error detection does make sense, but it should be provided
as an option, or perhaps as a gateway function, and it need not be
application-to-application.  Or, if you have a particularly finicky
or critical application that may not be satisfied with the 'standard'
reliability, it may wish to enhance it and should do so whether the
data transfer involves a network, a disk, a tape, or whatever.  Moreover,
it now has a fighting chance of doing what you want because it has at least
a vague idea of the kind of underlying error rates it has to deal with.
Finally, I will concede the possibility that under the right (unusual)
conditions it may be cheaper to build hardware with no error control and
rely entirely on end-to-end error detection in software, but I surmise
that the performance penalty associated with providing error detection of
any capability in software (say, the equivalent of a 32-bit CRC) will render
this a rather unpopular option.

Comments welcome.  Thanks for your attention.


-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    utcsri!uthub!thomson, thomson@hub.toronto.edu