Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!steinmetz!sunray!oconnor
From: oconnor@sunray.steinmetz (Dennis Oconnor)
Newsgroups: comp.arch
Subject: Divides (Was RE: What should be in hardware but isn't)
Message-ID: <7460@steinmetz.steinmetz.UUCP>
Date: Fri, 25-Sep-87 09:54:13 EDT
Article-I.D.: steinmet.7460
Posted: Fri Sep 25 09:54:13 1987
Date-Received: Sun, 27-Sep-87 02:33:43 EDT
References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP>
Sender: root@steinmetz.steinmetz.UUCP
Reply-To: oconnor@sunray.UUCP (Dennis Oconnor)
Organization: General Electric CRD, Schenectady, NY
Lines: 80

( All elipses ... are mine, and indicate excluded text. DMOC )
In article <6336@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
> ... Most RISC architectures have a divide step instruction, which
> is precisely what underlying microcode would use ...

  Our RISC architecture here at GE has no divide-step or
  multiply-step. We have a better way. More later.

> ... any hardware support in excess of this will inevitably slow
> the basic cycle down (I've been through the exercise).

  No, this is not true. Cycle time is generally dependant on
  some set of critical paths. Hardware that does not interact
  with these critical paths has no effect, unless it creates new
  critical paths. Were your critical paths lie depends heavily
  on implementation technology : could be the ALU, or the
  register file, or the instruction decode ...

> ... Microcode, or nanocode, has to go through all the same
> operations that assembly level code does. 

  Except fetching instructions, and operations resulting from
  the need to handle interupts or exceptions at arbitrary points
  in the assembly code (microcode can lock excepts out till it completes)

>... Its VERY difficult to make fixed point division run faster than
> a bit per cycle, without a LOT of hardware. By leaving out the
> special purpose speedup stuff, you can afford to include some VERY
> useful general purpose speedup stuff: More registers ... branch folding ...

  This is not really true. If you have a fast multiplier ( which is 
  a good idea for many applications ) you can do division very much
  quicker than one cycle per bit, relatively easily, especially for
  long word lengths. In fact, you can do division in something like

   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)

  where C and K are positive integer small constants dependant on
  how you implement your algorithm. The technique to use is
  Newton-Raphson iteration with a first-guess look-up table.
  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
  The additional hardware needed (besides a fast multiplier) is TWIT.

>> [quote from someone else about allowing registers
>>  to be accessed as memory locations]
>The original PDP-10 from DEC allowed ...  Registers were the first
>16 locations in memory ... instructions could put into the registers
>and executed from them ...

Off course, the PDP10 didn't have to reorganize code, so it did
not have to deal with memory-aliasing problems. 

>The ATT CRISP doesn't have any registers. But, by caching the top of
>the local frame, references to locals are effectively turned into
>register references, and you get register windows as well. You can
>index into these 'registers', byte access them, and reference them
>with short 5-bit fields in the instruction.

  One of the NICE things about registers that are NOT accessable
  as memory is that you can uniquely identify references to a register
  based strictly on the bits in the instruction stream. This is
  crucial to reorganization : you must know when registers are
  modified as a limit on how uses of that register can be moved.
  Memory-aliasing can be a difficult task, especially if post-
  reorganization linking is supported. How does the CRISP
  reorganizer address this issue ?

  Simple reorganizers ( a contradiction ) deal with memory aliasing
  by forcing serialization of loads with respect to stores. If
  your registers are accessable as memory and you use this scheme
  in your reorganization, you wind up serializing every instruction
  with respect to stores. What's the cost of this ?

>{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385


--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"