Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!cornell!batcomputer!pyramid!voder!apple!bcase
From: bcase@apple.UUCP (Brian Case)
Newsgroups: comp.arch
Subject: Re: Divides (Was RE: What should be in hardware but isn't)
Message-ID: <6358@apple.UUCP>
Date: Sat, 26-Sep-87 18:55:54 EDT
Article-I.D.: apple.6358
Posted: Sat Sep 26 18:55:54 1987
Date-Received: Sun, 27-Sep-87 11:46:20 EDT
References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP> <7460@steinmetz.steinmetz.UUCP>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc., Cupertino, USA
Lines: 55

In article <7460@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>( All elipses ... are mine, and indicate excluded text. DMOC )
>In article <6336@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:

>>... Its VERY difficult to make fixed point division run faster than
>> a bit per cycle, without a LOT of hardware. By leaving out the
>> special purpose speedup stuff, you can afford to include some VERY
>> useful general purpose speedup stuff: More registers ... branch folding ...
>
>  This is not really true. If you have a fast multiplier ( which is 
>  a good idea for many applications ) you can do division very much
>  quicker than one cycle per bit, relatively easily, especially for
>  long word lengths. In fact, you can do division in something like
>
>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>
>  where C and K are positive integer small constants dependant on
>  how you implement your algorithm. The technique to use is
>  Newton-Raphson iteration with a first-guess look-up table.
>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>  The additional hardware needed (besides a fast multiplier) is TWIT.

You are neglecting to realize that for many (most?) implementations,
a fast (i.e. parallel) multiplier *is* a LOT of hardware.  If you are
fitting a sufficiently big multiplier on your chip (I want to talk only
about single-chip implementations), then great, let's all use your
technology.  Yes, Newton-Raphson is used lots of places (e.g. Am29300
family, Am29027 floating-point accellerator that I know of personally).

>>The ATT CRISP doesn't have any registers. But, by caching the top of
>>the local frame, references to locals are effectively turned into
>>register references, and you get register windows as well.
>
>  One of the NICE things about registers that are NOT accessable
>  as memory is that you can uniquely identify references to a register
>  based strictly on the bits in the instruction stream. This is
>  crucial to reorganization : you must know when registers are
>  modified as a limit on how uses of that register can be moved.
>  Memory-aliasing can be a difficult task, especially if post-
>  reorganization linking is supported. How does the CRISP
>  reorganizer address this issue ?

Registers in the CRISP are *completely* transparent to software (well, I
bet there is some way in which they are not completely, but for the
purposes of this discussion, they are *completely*).  To a compiler,
reorganizer, linker, etc. a register reference is a memory reference.
It *does* behoove the compiling system components to know that memory
references with small offsets (less than 32 words away in the implementation
that I know of) from the Stack Pointer will run much faster (so that
local variables can be allocated to have fast access, etc.), but this
is not necessary.  Reorganizers have no artificial constraints placed
on them by the CRISP.  Personally, I think the implementation cost of
the CRISP model is too high, but the transparency acheived is wonderful 
(the register file can be expanded/shrunk at will, even a 0 size will
work just fine).