Path: utzoo!utgpu!watserv1!watmath!att!pacbell!pacbell.com!ucsd!usc!snorkelwacker!husc6!husc9!cleary
From: cleary@husc9.harvard.edu (Kenneth Cleary)
Newsgroups: comp.arch
Subject: What about hardware implementations of optimized algorithms?
Summary: As opposed to optimized circuits.
Message-ID: <4047@husc6.harvard.edu>
Date: 31 Aug 90 03:12:38 GMT
Sender: news@husc6.harvard.edu
Reply-To: cleary@husc9.UUCP (Kenneth Cleary)
Followup-To: comp.arch
Organization: Harvard University Science Center   Cambridge, MA
Lines: 46

While I find discussions of hardware optimizations achieved through
improvement of the electronic circuitry (such as GaAs, and higher-density
VLSI) and design improvements like pipelining or instruction caching, I
would like to hear about examples of optimized algorithms implemented as
digital circuits.

It is not difficult to find a few books on optimized algorithms implemented
as software, but I'm having difficulty finding out about hardware aspects.
(I'm not specifically referring to digital circuit optimizations like gate-
count reduction, though that also interests me.)  Most hardware optimizat-
ions I've seen assume a fixed algorithm which is to be implemented as
efficiently as possible, rather than looking at desired input and desired
output, and attempting to find the most efficient algorithm.

SOFTWARE CASE IN POINT:
	Donald E. Knuth, The Art of Computer Programming, 
	vol.2, Seminumerical Algorithms
	page 278, section 4.3.3. How Fast Can We Multiply?
Knuth talks about getting faster multiplication than order-n**2 behavior.
He describes some methods of getting behavior of n**(lg3) or n**1.585
but this only starts to provide a signifigant boost as the size of the
numbers being multiplied becomes quite large.  (i.e. it might not be worth
it for 8-bit & 16-bit numbers, but may be competitive for 32-bit & 64-bit.)

	HEAVY PARAPHRASING OF KNUTH BEGINS...
If we have two n-bit [let's say 32-bit] numbers {underscores for subscripts}
u=(u_31,u_30,...,u_1,u_0) and v=(v_31,v_30,...,v_1,v_0)  {individual bits}
we can write       u=(2**16)U_1 + U_0        v=(2**16)V_1 + V_0
basically breaking each number into its most-sig half & least-sig half, so
        uv= ((2**32)+(2**16))(U_1)(V_1) + 
            (2**16)((U_1)-(U_0))((V_0)-(V_1))+((2**16)+1)(U_0)(V_0)
"This formula reduces the problem of multiplying [32-bit] numbers to three
multiplications of [16-bit] numbers... plus some simple shifting and adding
operations."
This breaking up into two parts is done recursively, breaking the number
into smaller and smaller chunks. {reminds me of quicksort}
	PARAPHRASING ENDS. {My apologies for typos}

So, while there may not have been demand for this sort of approach with
smaller 8-bit chunks, it may provide so much of a boost at 64-bits, as to
make it more even more worthwhile to build 64-bit machines for handling
64-bit numbers, instead of chopping them into two 32-bit chunks to be
handled by a 32-bit processor.
(wait a minute... "breaking into two chunks" sounds familiar...)

Oh Well!  It's late, and I think I should stop typing right n.............