Path: utzoo!mnetor!uunet!husc6!cmcl2!nrl-cmf!ames!pasteur!ucbvax!UXC.CSO.UIUC.EDU!kline From: kline@UXC.CSO.UIUC.EDU (Charley Kline) Newsgroups: comp.protocols.tcp-ip Subject: Re: Checksums (again) Message-ID: <8804012153.AA04185@uxc.cso.uiuc.edu> Date: 1 Apr 88 21:53:44 GMT Sender: usenet@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 61 When I first started the CTSS TCP/IP project, Dave Clark suggested that I publish the checksum algorithm. His contention was that most checksummers are dreadfully machine-dependent, incomprehensible, and artful things. Well, I didn't put much effort into art, but I thought I'd oblige him. What I do to compute checksums on a Cray is a vector operation in assembler of all things. I can do up to 512 bytes of packet at a time. The core of the algorithm looks as follows. A1 holds the address of the 512-byte block of memory to checksum (I'm leaving out many details having to do with short blocks, because they are ugly). EBM A0 A1 VL 64 use full vectors S1 <32 form 32-bit mask from the right. A2 32 V1 ,A0,1 load packet into V1 V2 S1&V1 Form right-hand 32-bits in V2. V3 V1>A2 Form left-hand 32-bits in V3. V1 V2+V3 Add the two together. A2 63 Prepare to collapse into a scalar. S1 0 S4 <16 Form 16-bit mask from the right. A4 16 CK$LOOP S2 V1,A2 A2 A2-1 A0 A2 S1 S1+S2 JAN CK$LOOP S2 S1&S4 Form right-hand 16-bits in S2 S1 S1>A4 Form left-hand 16-bits in S1 S1 S1+S2 S2 S1&S4 Form right-hand 16-bits in S2 S1 S1>A4 Form left-hand 16-bits in S1 S1 S1+S2 S1 #S1 Take one's complement CMR At this point, S1 contains the checksum. Note that this takes full advantage of the fact that the one's complement sum is an Abelian Group--I can add 32-bit pieces instead of 16-bit pieces, as has been mentioned. I can also take advantage of the fact that the sum of the checksums of several blocks is the same as the checksum of the sum by carrying out this checksum on several blocks and adding up the answers. First I load two copies of the packet into two vector registers. One gets vector-shifted right 32 bits; the other gets vector-ANDED with a 32 bit mask. Then the two are added together. Since all these operations chain, I can produce one result per clock cycle. Then I collapse the results in the vector by adding each element to a scalar register. Finally, the end-around carry and conversion to 16-bits. It's all terribly fast. Now if only context-switching and IPC under CTSS were as fast... ----- Charley Kline, University of Illinois Computing Services kline@uxc.cso.uiuc.edu {ihnp4,uunet,pur-ee,convex}!uiucuxc!kline "Birth. School. Work. Death."