Path: utzoo!mnetor!uunet!husc6!cmcl2!nrl-cmf!ames!pasteur!ucbvax!UXC.CSO.UIUC.EDU!kline
From: kline@UXC.CSO.UIUC.EDU (Charley Kline)
Newsgroups: comp.protocols.tcp-ip
Subject: Re:  Checksums (again)
Message-ID: <8804012153.AA04185@uxc.cso.uiuc.edu>
Date: 1 Apr 88 21:53:44 GMT
Sender: usenet@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 61

When I first started the CTSS TCP/IP project, Dave Clark suggested that
I publish the checksum algorithm. His contention was that most checksummers
are dreadfully machine-dependent, incomprehensible, and artful things. Well,
I didn't put much effort into art, but I thought I'd oblige him.

What I do to compute checksums on a Cray is a vector operation
in assembler of all things. I can do up to 512 bytes of packet at a
time. The core of the algorithm looks as follows. A1 holds the address
of the 512-byte block of memory to checksum (I'm leaving out many
details having to do with short blocks, because they are ugly).

	EBM
	A0	A1
	VL	64		use full vectors
	S1	<32		form 32-bit mask from the right.
	A2	32
	V1	,A0,1		load packet into V1
	V2	S1&V1		Form right-hand 32-bits in V2.
	V3	V1>A2		Form left-hand 32-bits in V3.
	V1	V2+V3		Add the two together.
	A2	63		Prepare to collapse into a scalar.
	S1	0
	S4	<16		Form 16-bit mask from the right.
	A4	16
CK$LOOP S2	V1,A2
	A2	A2-1
	A0	A2
	S1	S1+S2
	JAN	CK$LOOP
	S2	S1&S4		Form right-hand 16-bits in S2
	S1	S1>A4		Form left-hand 16-bits in S1
	S1	S1+S2
	S2	S1&S4		Form right-hand 16-bits in S2
	S1	S1>A4		Form left-hand 16-bits in S1
	S1	S1+S2
	S1	#S1		Take one's complement
	CMR			At this point, S1 contains the checksum.


Note that this takes full advantage of the fact that the one's complement
sum is an Abelian Group--I can add 32-bit pieces instead of 16-bit pieces,
as has been mentioned. I can also take advantage of the fact that the sum of
the checksums of several blocks is the same as the checksum of the sum
by carrying out this checksum on several blocks and adding up the answers.

First I load two copies of the packet into two vector registers. One
gets vector-shifted right 32 bits; the other gets vector-ANDED with a
32 bit mask. Then the two are added together. Since all these operations
chain, I can produce one result per clock cycle. Then I collapse
the results in the vector by adding each element to a scalar register.
Finally, the end-around carry and conversion to 16-bits. It's all
terribly fast. Now if only context-switching and IPC under CTSS were
as fast...

-----
Charley Kline,  University of Illinois Computing Services
kline@uxc.cso.uiuc.edu
{ihnp4,uunet,pur-ee,convex}!uiucuxc!kline

"Birth.   School.   Work.   Death."