Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!daemon From: thomson@hub.toronto.edu (Brian Thomson) Newsgroups: comp.protocols.tcp-ip Subject: Re: more on Fletcher Message-ID: <8912071916.AA23996@beaches.hub.toronto.edu> Date: 7 Dec 89 19:18:33 GMT References: <8912060542.AA05854@WLV.IMSD.CONTEL.COM> <1989Dec6.190415.19049@brutus.cs.uiuc.edu> <18511@bellcore.bellcore.com> Sender: Organization: University of Toronto Lines: 97 In article <18511@bellcore.bellcore.com> karn@jupiter.bellcore.com (Phil R. Karn) writes: >... after someone else first writes: >>The TCP checksum is an end-to-end checksum. Saying >>"it's a hardware problem" is a cop-out... > >I am a big believer in the end-to-end argument. > Many people appear to subscribe to "the end to end argument". I don't know if they have been convinced by persuasive argument or by the eminence of others who have endorsed it. The arguments I have heard do not convince me, and I believe that in many instances data integrity can and should be a hardware problem. Saltzer, Reed, and Clark argued the end-to-end case in their paper in the November '84 TOCS, They offer a file transfer application as their paradigm, and note five threats to the integrity of the data stored at the receiving host: 1) Undetected disk or disk controller error on the sending host while reading the original file. 2) Software error in either host or in communications system. 3) Undetected processor or memory hardware error in either end system. 4) Communications error. 5) Host crashes during transfer. They then claim that a harware-implemented communications checksum only addresses fault 4, and that "the careful file transfer application must still counter the remaining threats; so it should still provide its own retries based on an end-to-end checksum of the file." Perhaps a careful file transfer application should, but to conclude by extension that all applications must is unwarranted. In counter-argument, note that the first three risks are not unique to communications systems, and will be faced by a careful single-host disk-to-disk file copying program. Does this mean that every disk block on your computer system contains a software generated checksum? Undoubtedly there are applications that do this, but in most cases the prevailing industry standard of reliability, enhanced as it might be by embedded parity/EDC/ECC hardware, is sufficient. The particular flavour of risk 5 encountered here is unique to distributed systems, but its detection involves handshaking and does not require "an end-to-end checksum of the file." A related end-to-end argument, enunciated by Cheriton in his XXX VMTP paper, states that only the application knows how small an error probability is tolerable, so the application must implement the checksum. This argument is incorrect, firstly because checksumming can not establish a maximum error probability, it can only reduce it. An application may be willing to tolerate a 10^-18 error rate, but it cannot determine how strong an error detection scheme to use unless it also knows what error rate it is starting with. Secondly, different error detection methods have different strengths and weaknesses, and an appropriate selection may require knowledge of the nature of the most probable errors. In my view, data integrity can be effectively provided by chaining together reliable links, and if this is done properly then no end-to-end check is necessary. "Doing it properly" means providing a comparable input-to-output error rate as you get from your favourite disk drive/controller combination, or tape/formatter combination. That likely means some error detection (or maybe correction, for something like a satellite channel) "on the wire", and maybe some parity or something even fancier in the communications processor memory, and maybe you need to ensure that the domains of protection of these things overlap so that errors don't sneak in through the cracks. Check out what Ciprico/Xylogics/etc. do with the caches on their intelligent disk controllers. Provide the same level of service they do, it seems to be adequate for most applications. Who knows, you might even be able to write applications that don't have to be told whether they are dealing with disks or networks, becuase they same de facto reliability standard applies to both. When does this approach fail? Obviously, if your packets are going over a network you know nothing about, or have no control over, it is not safe to make assumptions about its error properties. In that case, higher-level error detection does make sense, but it should be provided as an option, or perhaps as a gateway function, and it need not be application-to-application. Or, if you have a particularly finicky or critical application that may not be satisfied with the 'standard' reliability, it may wish to enhance it and should do so whether the data transfer involves a network, a disk, a tape, or whatever. Moreover, it now has a fighting chance of doing what you want because it has at least a vague idea of the kind of underlying error rates it has to deal with. Finally, I will concede the possibility that under the right (unusual) conditions it may be cheaper to build hardware with no error control and rely entirely on end-to-end error detection in software, but I surmise that the performance penalty associated with providing error detection of any capability in software (say, the equivalent of a 32-bit CRC) will render this a rather unpopular option. Comments welcome. Thanks for your attention. -- Brian Thomson, CSRI Univ. of Toronto utcsri!uthub!thomson, thomson@hub.toronto.edu