Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!steinmetz!sunray!oconnor
From: oconnor@sunray.steinmetz (Dennis Oconnor)
Newsgroups: comp.arch
Subject: Re: Divides (Was RE: What should be in hardware but isn't)
Message-ID: <7481@steinmetz.steinmetz.UUCP>
Date: Wed, 31-Dec-69 18:59:59 EDT
Article-I.D.: steinmet.7481
Posted: Wed Dec 31 18:59:59 1969
Date-Received: Thu, 1-Oct-87 03:29:13 EDT
References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP> <7460@steinmetz.steinmetz.UUCP> <6370@apple.UUCP>
Sender: root@steinmetz.steinmetz.UUCP
Reply-To: oconnor@sunray.UUCP (Dennis Oconnor)
Organization: General Electric CRD, Schenectady, NY
Lines: 121

In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>Are there any reference to the GE RISC architecture? Its a new one to me.

  There is a paper in the upcoming GOMAC which is the first general
  disclosure. The system had a blurb in AW&ST about 18 months ago.

>I said: Microcode, or nanocode, has to go through all the same
>operations that assembly level code does.
>
>>  Except fetching instructions, and operations resulting from
>>  the need to handle interupts or exceptions at arbitrary points
>>  in the assembly code (microcode can lock excepts out till it completes)
>
>Assembly language doesn't fetch instructions, hardware does it
>automatically. Generally its microcode that has to fetch the assembly
>language instructions, and hardware that fetches the microcode. Is
>there something to this analogy I'm missing? You're not clear.

  Replacing a microcode sequence with an assembly language sequence
  increases the number of instructions that need to be fetched to
  perform the operation. Good cache design can minimize this penalty.

>Assembly language is perfectly capable of locking out interrupts. You
>may have to be in supervisor state or something to do it, but every
>machine I've every seen with interrupts has a way to turn them off.
>This is far more flexible than having only specific microcode routines, which
>are locked into ROM, be able to turn off interrupts.

  Microcode doesn't need to be in supervisor state to temporarily
  disable interupts, doesn't need the user to know interupts need
  disabling, and doesn't prevent you from providing an interupt-disable
  to the user. So what does microcode lose you in this respect?

  This is NOT a CISC/RISC argument : I favor RISC, but let's not
  kid ourselves that CISCs don't have some advantages. As we all know.
  That's like comparing trains to trucks : there are wins for each.

>I said: Its VERY difficult to make fixed point division run faster than
>a bit per cycle, without a LOT of hardware. By leaving out the
>special purpose speedup stuff, you can afford to include some VERY
>useful general purpose speedup stuff: More registers ... branch folding ...
>
>>  This is not really true. If you have a fast multiplier ( which is 
>>  a good idea for many applications ) you can do division very much
>>  quicker than one cycle per bit, relatively easily, especially for
>>  long word lengths. In fact, you can do division in something like
>>
>>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>>
>>  where C and K are positive integer small constants dependant on
>>  how you implement your algorithm. The technique to use is
>>  Newton-Raphson iteration with a first-guess look-up table.
>>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>>  The additional hardware needed (besides a fast multiplier) is TWIT.
>
>I am not unaware of the Newton-Raphson divide algorithm.  If you
>think that the look-up table, or the fast multiplier, or the datapath
>logic you have to put around the multiplier to do the Newton-Raphson
>iteration is trivial, then you've haven't designed one. I have. It
>isn't.  It is also not necessarily faster than a bit at a time
>divider, depending on your fast multiply time, word length, look-up
>table, etc. I've been told (by someone who was there) that although
>Amdahl insisted on this approach in the 470, its was discovered
>afterwards that it ran slower than the usual serial approach would have.

I specifically stated that IF you had a fast multiplier around,
this was easy, so saying fast multipliers are non-trivial doesn't
contradict me. Now, I HAVE designed everything you say. Really.
It has the look-up table in random logic ( only need a few bits ).
Has a clever normalize-denominator instruction (needed).
Does the iteration in assembly language (it's a RISC machine)
so no additional datapath is required. And will be in silicon RSN.

I hate "you haven't designed one, I have" arguments. Well, let's
make it clear : I (and the rest of the GE team) HAVE DESIGNED A
RISC PROCESSOR WITH NR-ITERATION FOR DIVISION. Details will have
to wait for the official public release. But since YOU have designed
one (I guess) and I have designed one, and we disagree, then
we must have been designing in different contexts (i.e. speed targets,
architecture limits, implementation technology ... ). You show me
you design and context and I'll show you mine (:-).

( side note : don't you think some designers get a raw deal? I mea
here is some world-class designer at Intel, and he has to design
a new competive processor that's upward-compatable with the 8086.
Will the CS world look at his design and say, "Wow, what a great
job he did given the constraints he was under!"? No, most of the
CS world will just say "What a Kludge!". We're talkin unfair here. )

I never said NR-Iteration was faster than the serial method, I simply
gave an equation for predicting how fast it was. Given an equation for
serial division, like :     C2 + (subtract_latency * word_length),
it's relatively easy to see that sometimes one is faster, sometimes
the other is. No surprise. For us, NR-Iteration was the winner. Your
mileage may very. But do I care how fast you do an 8-bit divide? No.

I said in an article:
>>Of course, the PDP10 didn't have to reorganize code, so it did
>>not have to deal with memory-aliasing problems. 

In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>The PDP-10 didn't have to re-organize code. Neither do RISC
>architectures.  Neither do 370's. But, you might be surprised to find
>that on these machines, re-organizating the code to eliminate
>pipeline interlocks will speed up your code. [Reference deleted]
>{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

Sure, RISCs don't need to reorganize, and CRAYs don't need hard disks,
floppies will work just fine! Semantic Baloney. If by "needs" we
mean "must have in order to be commercially viable", or "must have
to show any advantage over other architecture classes", then RISCs
do need reorganizers. I never said other classes of architecture
(like CISCs) couldn't benefit from them, but they don't exhibit
as large a performance penalty when you don't use one as RISCs do.

And before you say "You haven't written a reorganizer.", well,
it's not finished, but I am on a team writing one NOW. (:-)
--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"