Newsgroups: comp.arch
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!sarah!bingnews!kym
From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell)
Subject: Re: RISC integer multiply/divide (was Re: Snake)
Message-ID: <1991Apr6.170236.2628@bingvaxu.cc.binghamton.edu>
Organization: State University of New York at Binghamton
References: <9538@mentor.cc.purdue.edu> <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> <3275@charon.cwi.nl>
Date: Sat, 6 Apr 1991 17:02:36 GMT

In article <3275@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
[I didn't understand the need for 17x17->34 arithmetic do do the 3 multiply 
trick]
>To understand it, see below.

Well I'm still not convinced. In the followup article I _think_ I have a
working 32x32->64 multiply (not that I've checked it exhaustively mind
you). Both the naive and 3-multiply versions seem to get the same
answers over a large number of cases (and I did some independent checking 
of the ``end conditions'' of all simple patterns of n-bits rotated in a 32-bit 
field  and those pretty pattern suggested by Knuth that are easy to eyeball). 
I don't see anywhere that uses 33x33->66 bit arithmetic although a bit of 
sign-testing goes on before we do the product-of-differences stuff.

Forgive me if I'm wrong, but to convert the unsigned routines to
signed ones just requires a couple of further 32-bit adds; still no
funny multiplies.

[I talk about multiply-and-add and shift-and-add perhaps having an
impact on 3-m vs 4-m; esp on DSP chips such as TI 320, _maybe_ on 6000]
>TI 320, yes, RS6000, no.  As I mailed you, this came from a remark from me
>about the m88k.  As on that machine with proper code a multiply takes only
>a single cycle, there is no way to improve it with 3 multiplies only.  

Well I don't know what you _think_ you said in your email. But all I
have on file is ``all those halfword adds are not worth the pain'';
I don't see anything there about m88k's (although that may have been
implicit, it doesn't read that way).

>My findings are that using an order n squared long integer multiply still pays
>for fairly large numbers (hundreds of decimal digits on current architectures).

Well on any machine where multiplies < about 6.5 average instructions
I would agree with you (see my previous post for derivation). In the
context of other architectures, which are abundant but not common :-)
maybe this is not the case. As all on comp.arch know by now, I'm not one for
taking people's word for anything when I can measure it for myself.

[long analysis of poor programs]
>Anyhow, your implementation is worthless, even if we expunge the glaring errors.

Sorry, but I thought it was apparent they were meant to be illustrative, 
not implementations. It's pretty hard to make C do a real 16x16->32 multiply 
in the first place :-)

As illustrations I guess they leave something to be desired due to the
various ambiguities you outlined, but I hope some of this was cleared up
by the follow-up post of what (I hope) are working versions.

-kym