Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew
From: aglew@oberon.csg.uiuc.edu (Andy Glew)
Newsgroups: comp.arch
Subject: Re: Question on division methods
Message-ID: <AGLEW.90Jun4144801@oberon.csg.uiuc.edu>
Date: 4 Jun 90 18:48:01 GMT
References: <26681c80.27e8@petunia.CalPoly.EDU>
Sender: usenet@ux1.cso.uiuc.edu (News)
Distribution: usa
Organization: University of Illinois, Computer Systems Group
Lines: 87
In-Reply-To: mdeale@sargas.acs.calpoly.edu's message of 2 Jun 90 20:07:28 GMT

To: mdeale@cosmos.acs.calpoly.edu
Subject: Shift Division

>   I believe that given the ever increasing frequencies for micro-
>processors, the division operation would do well by using a
>shift-and-subtract method vs. using a multiplier based technique.
>An extra shifter and subtractor would take up minimal chip area,
>and division could procede *simultaneously* with multiplication.
>
>   On the other hand, the i860 can produce a result every instruction
>clock cycle. "Newton"s method could be accomplished in about 10 cycles;
>although I hear the actual figure is more like 21-22 cycles (to account
>for exponent processing?). But then the multiply unit is tied up for
>the division operation, which could be a real performance hit on
>certain applications.

As usual, "it depends".

In fact, at this moment most integer processors implement divide with a shift and add,
except for the 88K which has an on-chip multiplier, and the ones which borrow
from the FPU chip.

For a GaAs microprocessor with very limited device count, shift mechanisms
are probably better.

For a CMOS processor where, increasingly, the problem is "I can get 2 million
transistors on a chip[*], but the processor only takes 350,000. What is the best
thing to use the remaining transistors for?" the answer is not so clear.
[*] Note that 1.2 million is now. Chips being designed now are a bit larger.
    What can the extra devices be used for?
    	- large on-chip caches
    	    	probably for a while longer.
    	    	however, there is a fall-off point around 64K-256K bytes of cache.
    	- on-chip real memory
    	    	trouble is, we don't have enough space to provide really interesting
    	    	amounts of memory (top of the line workstations require ~16M now).
    	    	There's a market in embedded systems for just about every bit
    	    	of on-chip memory size, so these chips will be built.  And,
    	    	at some point, bottom-of-the-line workstations will take a sudden
    	    	drop in price by using on-chip main memory as embedded processors
    	    	cross over.
    	- add special functional units on-chip
    	    	A lot of companies are already doing this:
    	    	- parallel array multipliers
    	    	- floating point
    	    	- mmu
    	    	- graphics ops
    	    
    	    	However, there are only so many functional units that a general
    	    	purpose microprocessor can use. Once again, embedded processors
    	    	can use many more, so there will probably be growth in that 
    	    	market segment.  And then, in a few years, somebody will realize
    	    	that an embedded processor with HUD (Heads-up-display) controller
    	    	for output, a single serial line for keyboard, on-chip timer,
    	    	and 4M of RAM is a decent wristwatch computer...
    	    	    (Ethernet on-chip, unfortunately, is a good bit further
    	    	away. Sigh).

    	- parallelism
    	    	There are several types of on-chip parallelism that can be explored:
    	    	- superscalar
    	    	- liw
    	    	Both of the above are being explored pretty much already.
    	    	- multiple processors per chip
    	    	Certainly possible, but this doesn't help you with the single large job
    	    	(the big selling point), and, to build a really large high performance
    	    	system you have to solve the off-chip performance problem anyway.
    	    	    However, it is something worth considering, especially when
    	    	you can make tradeoffs like providing only one multiplier array
    	    	shared between two processors. Research topic?

From time to time I draw out graphs of where I think these trends are going,
and when those cut-overs from embedded to general purpose will occur.
I'd appreciate any data points for these graphs.

Anyway, about divides:
    Divide is one of those things that, if you don't do it well, you
will be excluded from certain markets.  My guess would be that if your
memory access time (cache *hit* time) is below, say, 16 cycles
parallel divide using a multiplier will win if you can fit it in. 
Otherwise, serial divide; and serial divide when you can't fit a
muliplier in.


By the way, this is what comp.arch should be talking about. Hence my posting.
--
Andy Glew, aglew@uiuc.edu