Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!sunybcs!bingvaxu!leah!itsgw!batcomputer!pyramid!voder!apple!baum From: baum@apple.UUCP (Allen J. Baum) Newsgroups: comp.arch Subject: Re: Divides (Was RE: What should be in hardware but isn't) Message-ID: <6450@apple.UUCP> Date: Fri, 9-Oct-87 20:12:51 EDT Article-I.D.: apple.6450 Posted: Fri Oct 9 20:12:51 1987 Date-Received: Mon, 12-Oct-87 01:18:31 EDT References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> Reply-To: baum@apple.UUCP (Allen Baum) Organization: Apple Computer, Inc. Lines: 86 -------- [] This had turned into a really interesting argument, so here is my contribution to prolong it. >[ paraphrased from ( Dennis O'Connor's ) earlier article ] *A Newton-Rapheson interation divider needs: *A fast multiplier *It has the look-up table in random logic ( only need a few bits ). *Has a clever normalize-denominator instruction (needed). *Does the iteration in assembly language (it's a RISC machine) *so no additional datapath is required. And will be in silicon RSN. The obvious ways to do multi-bit/cycle integer divide require that at least one of the operands be normalized (like floating point- MSB=1). I believe that N-R also requires the divisor to be normalized, which is what the normalizer/ denormalizer is used for (as well as floating point in software). The N-R algorithm doubles the number of quotient bits on each iteration. To get an initial seed of N bits, you need a ROM (or gate equivalent) of 2^(N+2) x N. If you want to do this with gates, N is going to be less than 8. Suppose it is 4. The, on the first iteration, you need a 32 x 4 multiply, followed by 32 x 8, followed by 32x16. You may need a final 32x32 multiply to be accurate to the last bit; I can't remember the details. This looks pretty good, but I have to expound Baum's approximation (or lemma, or rule, or something). You can't build a multiplier with a multiplier (as opposed to multiplicand) of more than 8 bits that runs faster than a full adder in the same technology (and this is a carry-save multiplier- obviously, the full add at the end takes a full-add time). And, your cycle time should not be much longer than a full-add time; if it is longer, you have an extremely inefficient design (I'm willing to make 'much longer' be up to a factor of two-- I'm easy), since all the common operations run slower. Even a 'fast' multiply is a multi-cycle operation with these rules. You can, of course, pipeline such a multiplier so you can start a new multiply every cycle. You might even be able to build a full Wallace tree, but the 'fast' multiplier is likely to exceed the size of the rest of the processor. Note that all these arguments apply only to general purpose processors. Using these guidelines (which certainly could be questioned, but...), I think the latter stages of the divide iteration will be multi-cycle operations, or that the cycles will be much longer than they would be otherwise. This is a little generous, in that I assume that the early iterations are faster than the later ones, since they need to multiply fewer bits. This takes a bit of control logic to implement, and possibly some input multiplexing to the multipler. On the other hand, each multiply iteration does need a full carry-propagate adder at its end, so the divide step must be a lot longer than a simple add. By the same kinds of reasoning, I don't think that a normalize is a one cycle operation (for the kinds of short cycles you OUGHT to have). Not only do you essentially need a 'count leading zeroes' function, but the output of this has to feed a shifter. It can be done as a five step process (are the first 16 bits=0? If so <<16. Now, are the first 8bits=0? .....). This is getting perilously close to full-add speeds. Of course, I suppose you might have a slow adder.... In an earlier posting, the statement was made that replacing a microcode sequence with assembly language increases the number of instructions that must be fetched. I suppose that if you have microcode that is sufficiently tailored to a particular problem (i.e. emulating some assembly language), then it would probably be shorter. For general sorts of operations (like Loading, Storing, Branching, the kinds of things computers generally do) this is not the case. If a fixed microcode sequence is used to emulate some assembly language which implements some operation, than the microcode sequence will likely take longer. If, in fact, your microcode sequences are always shorter than assembly language, then you should do all your programming in microcode and call it assembly language. Just like the RISCs do. Finally, I agree that microcode doesn't need to be in supervisor state to turn off interrupts (as opposed to most assembly language level machines), so microcode doesn't lose you anything. But the question is: what does microcode GAIN you? I will, of course, concede that there are places outside an operating system kernel that need to do it. I don't know that I would concede that there are enough places outside a kernel that need it badly enough to make any performance difference, and that's the bottom line. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385