Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!rutgers!uwm.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew
From: aglew@crhc.uiuc.edu (Andy Glew)
Newsgroups: comp.arch
Subject: Re: int x int -> long for * (or is it 32x32->64)
Message-ID: <AGLEW.90Sep16115215@dwarfs.crhc.uiuc.edu>
Date: 16 Sep 90 16:52:15 GMT
References: <3984@bingvaxu.cc.binghamton.edu> <41425@mips.mips.COM>
	<4025@bingvaxu.cc.binghamton.edu> <69436@sgi.sgi.com>
Sender: news@ux1.cso.uiuc.edu (News)
Organization: Center for Reliable and High-Performance Computing University of
	Illinois at Urbana Champaign
Lines: 40
In-Reply-To: vjs@rhyolite.wpd.sgi.com's message of 15 Sep 90 05:39:46 GMT

In article <4025@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> I'm afraid the hard-hearted answer has got to be along the lines of:
> "What proportion of time is spent in the checksum code, & by what factor
> is it increased by not having the convient carry"?

On a 68030 based system I found that 15% of the entire CPU was being
spent in in_chksum() (my memory may be failing wrt. exact name), on
real user workloads (namely, the backbone machines on our local net).
This was for the naive byte-at-a-time one's complement sum (again,
memory may be failing me.  I believe it was a one's complement sum,
but there have been quite a variety of checksums).

Unrolling the loop and computing the checksum 32 bits at a time
instead of 8 bits at a time gave me approximately a 6-8-fold speedup.
A bit of instrumentation showed that the overwhelming majority of
packets were of only two sizes, and these were special cased.

With the new code, in_chksum() was reduced to around 4% of the CPU.
(Not linearly divided by speedup because of overhead, and traffic
increase).  I used the carry-out and carry-in to do this. Coding
without the carry would approximately double number of instructions
for this checksum, but many of the added instructions would be
branches.  Actually, I'd probably only do it 16 bits at a time, no
branches, which would be, again, a 3-fold slowdown (shifts and masks
in the loop).
    Ie. based on my experience coding in_chksum(), but not having
coded it on a MIPS, I would estimate that the slowdown through not
having carry out and in is approximately 3-fold wrt. good code that
uses carry-out and in. But this is only an upper bound, because
overhead of call, etc., gets in the way.

I do not wish to pass judgement on the usefulness of carries; I only
wished to provide a data point for "by what factor is it [the
checksum] increased by not having the convenient carry".


--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]