Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!sunybcs!bingvaxu!leah!itsgw!batcomputer!pyramid!voder!apple!baum
From: baum@apple.UUCP (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: Divides (Was RE: What should be in hardware but isn't)
Message-ID: <6450@apple.UUCP>
Date: Fri, 9-Oct-87 20:12:51 EDT
Article-I.D.: apple.6450
Posted: Fri Oct  9 20:12:51 1987
Date-Received: Mon, 12-Oct-87 01:18:31 EDT
References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM>
Reply-To: baum@apple.UUCP (Allen Baum)
Organization: Apple Computer, Inc.
Lines: 86

--------
[]
This had turned into a really interesting argument, so here is my contribution
to prolong it.

>[ paraphrased from ( Dennis O'Connor's ) earlier article ]
*A Newton-Rapheson interation divider needs:
*A fast multiplier
*It has the look-up table in random logic ( only need a few bits ).
*Has a clever normalize-denominator instruction (needed).
*Does the iteration in assembly language (it's a RISC machine)
*so no additional datapath is required. And will be in silicon RSN.

The obvious ways to do multi-bit/cycle integer divide require that at
least one of the operands be normalized (like floating point- MSB=1).
I believe that N-R also requires the divisor to be normalized, which
is what the normalizer/ denormalizer is used for (as well as floating
point in software).

The N-R algorithm doubles the number of quotient bits on
each iteration. To get an initial seed of N bits, you need a ROM (or
gate equivalent) of 2^(N+2) x N. If you want to do this with gates, N is
going to be less than 8. Suppose it is 4. The, on the first iteration,
you need a 32 x 4 multiply, followed by 32 x 8, followed by 32x16. You
may need a final 32x32 multiply to be accurate to the last bit; I can't
remember the details.

This looks pretty good, but I have to expound Baum's approximation
(or lemma, or rule, or something). You can't build a multiplier with
a multiplier (as opposed to multiplicand) of more than 8 bits that
runs faster than a full adder in the same technology (and this is a
carry-save multiplier- obviously, the full add at the end takes a
full-add time). And, your cycle time should not be much longer than a
full-add time; if it is longer, you have an extremely inefficient
design (I'm willing to make 'much longer' be up to a factor of two--
I'm easy), since all the common operations run slower. Even a 'fast'
multiply is a multi-cycle operation with these rules. You can, of
course, pipeline such a multiplier so you can start a new multiply
every cycle. You might even be able to build a full Wallace tree, but
the 'fast' multiplier is likely to exceed the size of the rest of the
processor. Note that all these arguments apply only to general
purpose processors.

Using these guidelines (which certainly could be questioned, but...),
I think the latter stages of the divide iteration will be multi-cycle
operations, or that the cycles will be much longer than they would be
otherwise. This is a little generous, in that I assume that the early
iterations are faster than the later ones, since they need to
multiply fewer bits. This takes a bit of control logic to implement,
and possibly some input multiplexing to the multipler. On the other
hand, each multiply iteration does need a full carry-propagate adder at its
end, so the divide step must be a lot longer than a simple add.

By the same kinds of reasoning, I don't think that a normalize is a one
cycle operation (for the kinds of short cycles you OUGHT to have). Not only
do you essentially need a 'count leading zeroes' function, but the output
of this has to feed a shifter. It can be done as a five step process
(are the first 16 bits=0? If so <<16. Now, are the first 8bits=0? .....).
This is getting perilously close to full-add speeds. Of course, I suppose
you might have a slow adder....

In an earlier posting, the statement was made that replacing a
microcode sequence with assembly language increases the number of
instructions that must be fetched.

I suppose that if you have microcode that is sufficiently tailored to
a particular problem (i.e. emulating some assembly language), then it
would probably be shorter. For general sorts of operations (like
Loading, Storing, Branching, the kinds of things computers generally
do) this is not the case. If a fixed microcode sequence is used to
emulate some assembly language which implements some operation, than the
microcode sequence will likely take longer.

If, in fact, your microcode sequences are always shorter than assembly
language, then you should do all your programming in microcode and call
it assembly language. Just like the RISCs do.

Finally, I agree that microcode doesn't need to be in supervisor
state to turn off interrupts (as opposed to most assembly language
level machines), so microcode doesn't lose you anything. But the question is:
what does microcode GAIN you? I will, of course, concede that there are places
outside an operating system kernel that need to do it. I don't know that
I would concede that there are enough places outside a kernel that need it
badly enough to make any performance difference, and that's the bottom line.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385