Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew From: aglew@oberon.csg.uiuc.edu (Andy Glew) Newsgroups: comp.arch Subject: Re: Question on division methods Message-ID: Date: 4 Jun 90 18:48:01 GMT References: <26681c80.27e8@petunia.CalPoly.EDU> Sender: usenet@ux1.cso.uiuc.edu (News) Distribution: usa Organization: University of Illinois, Computer Systems Group Lines: 87 In-Reply-To: mdeale@sargas.acs.calpoly.edu's message of 2 Jun 90 20:07:28 GMT To: mdeale@cosmos.acs.calpoly.edu Subject: Shift Division > I believe that given the ever increasing frequencies for micro- >processors, the division operation would do well by using a >shift-and-subtract method vs. using a multiplier based technique. >An extra shifter and subtractor would take up minimal chip area, >and division could procede *simultaneously* with multiplication. > > On the other hand, the i860 can produce a result every instruction >clock cycle. "Newton"s method could be accomplished in about 10 cycles; >although I hear the actual figure is more like 21-22 cycles (to account >for exponent processing?). But then the multiply unit is tied up for >the division operation, which could be a real performance hit on >certain applications. As usual, "it depends". In fact, at this moment most integer processors implement divide with a shift and add, except for the 88K which has an on-chip multiplier, and the ones which borrow from the FPU chip. For a GaAs microprocessor with very limited device count, shift mechanisms are probably better. For a CMOS processor where, increasingly, the problem is "I can get 2 million transistors on a chip[*], but the processor only takes 350,000. What is the best thing to use the remaining transistors for?" the answer is not so clear. [*] Note that 1.2 million is now. Chips being designed now are a bit larger. What can the extra devices be used for? - large on-chip caches probably for a while longer. however, there is a fall-off point around 64K-256K bytes of cache. - on-chip real memory trouble is, we don't have enough space to provide really interesting amounts of memory (top of the line workstations require ~16M now). There's a market in embedded systems for just about every bit of on-chip memory size, so these chips will be built. And, at some point, bottom-of-the-line workstations will take a sudden drop in price by using on-chip main memory as embedded processors cross over. - add special functional units on-chip A lot of companies are already doing this: - parallel array multipliers - floating point - mmu - graphics ops However, there are only so many functional units that a general purpose microprocessor can use. Once again, embedded processors can use many more, so there will probably be growth in that market segment. And then, in a few years, somebody will realize that an embedded processor with HUD (Heads-up-display) controller for output, a single serial line for keyboard, on-chip timer, and 4M of RAM is a decent wristwatch computer... (Ethernet on-chip, unfortunately, is a good bit further away. Sigh). - parallelism There are several types of on-chip parallelism that can be explored: - superscalar - liw Both of the above are being explored pretty much already. - multiple processors per chip Certainly possible, but this doesn't help you with the single large job (the big selling point), and, to build a really large high performance system you have to solve the off-chip performance problem anyway. However, it is something worth considering, especially when you can make tradeoffs like providing only one multiplier array shared between two processors. Research topic? From time to time I draw out graphs of where I think these trends are going, and when those cut-overs from embedded to general purpose will occur. I'd appreciate any data points for these graphs. Anyway, about divides: Divide is one of those things that, if you don't do it well, you will be excluded from certain markets. My guess would be that if your memory access time (cache *hit* time) is below, say, 16 cycles parallel divide using a multiplier will win if you can fit it in. Otherwise, serial divide; and serial divide when you can't fit a muliplier in. By the way, this is what comp.arch should be talking about. Hence my posting. -- Andy Glew, aglew@uiuc.edu