Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!mit-eddie!genrad!decvax!decwrl!pyramid!voder!apple!baum
From: baum@apple.UUCP (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: Divides (Was RE: What should be in hardware but isn't)
Message-ID: <6370@apple.UUCP>
Date: Mon, 28-Sep-87 17:13:54 EDT
Article-I.D.: apple.6370
Posted: Mon Sep 28 17:13:54 1987
Date-Received: Wed, 30-Sep-87 00:49:01 EDT
References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP> <7460@steinmetz.steinmetz.UUCP>
Reply-To: baum@apple.UUCP (Allen Baum)
Organization: Apple Computer, Inc.
Lines: 109

--------
[]

In article <6336@apple.UUCP> I wrote:
 ... Most RISC architectures have a divide step instruction, which
 is precisely what underlying microcode would use ...

In article <7460@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>  Our RISC architecture here at GE has no divide-step or
>  multiply-step. We have a better way. More later.

Well, I never said all, I said most. I know machines that have neither, also.
Are there any reference to the GE RISC architecture? Its a new one to me.

I also said: any hardware support in excess of this will inevitably slow
the basic cycle down (I've been through the exercise).

 Yes, I was exagerating a bit there. Logic which is not on a critical
path will not inevitably slow down the cycle time. But it might all
the same- extra loading, having to fit the logic in somewhere,
causing wires to lengthen... These are all second order effects, but
they exist, and they are real and measurable.

I said: Microcode, or nanocode, has to go through all the same
operations that assembly level code does.

>  Except fetching instructions, and operations resulting from
>  the need to handle interupts or exceptions at arbitrary points
>  in the assembly code (microcode can lock excepts out till it completes)

Assembly language doesn't fetch instructions, hardware does it
automatically. Generally its microcode that has to fetch the assembly
language instructions, and hardware that fetches the microcode. Is
there something to this analogy I'm missing? You're not clear.

Assembly language is perfectly capable of locking out interrupts. You
may have to be in supervisor state or something to do it, but every
machine I've every seen with interrupts has a way to turn them off.
This is far more flexible than having only specific microcode routines, which
are locked into ROM, be able to turn off interrupts.

I said: Its VERY difficult to make fixed point division run faster than
a bit per cycle, without a LOT of hardware. By leaving out the
special purpose speedup stuff, you can afford to include some VERY
useful general purpose speedup stuff: More registers ... branch folding ...

>  This is not really true. If you have a fast multiplier ( which is 
>  a good idea for many applications ) you can do division very much
>  quicker than one cycle per bit, relatively easily, especially for
>  long word lengths. In fact, you can do division in something like
>
>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>
>  where C and K are positive integer small constants dependant on
>  how you implement your algorithm. The technique to use is
>  Newton-Raphson iteration with a first-guess look-up table.
>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>  The additional hardware needed (besides a fast multiplier) is TWIT.

I am not unaware of the Newton-Raphson divide algorithm.  If you
think that the look-up table, or the fast multiplier, or the datapath
logic you have to put around the multiplier to do the Newton-Raphson
iteration is trivial, then you've haven't designed one. I have. It
isn't.  It is also not necessarily faster than a bit at a time
divider, depending on your fast multiply time, word length, look-up
table, etc. I've been told (by someone who was there) that although
Amdahl insisted on this approach in the 470, its was discovered
afterwards that it ran slower than the usual serial approach would
have.

>>> [quote from someone else about allowing registers
>>>  to be accessed as memory locations]
>>The original PDP-10 from DEC allowed ...  Registers were the first
>>16 locations in memory ... instructions could put into the registers
>>and executed from them ...
>
>Of course, the PDP10 didn't have to reorganize code, so it did
>not have to deal with memory-aliasing problems. 

The PDP-10 didn't have to re-organize code. Neither do RISC
architectures.  Neither do 370's. But, you might be surprised to find
that on these machines, re-organizating the code to eliminate
pipeline interlocks will speed up your code. See "Coding guidelines for
Pipelined Processors" by Rymarczyk of IBM in the ASPLOS I Proceedings.

>>The ATT CRISP doesn't have any registers. But, by caching the top of
>>the local frame, references to locals are effectively turned into
>>register references, and you get register windows as well. You can
>>index into these 'registers', byte access them, and reference them
>>with short 5-bit fields in the instruction.
>
>  One of the NICE things about registers that are NOT accessable
>  as memory is that you can uniquely identify references to a register
>  based strictly on the bits in the instruction stream. This is
>  crucial to reorganization : you must know when registers are
>  modified as a limit on how uses of that register can be moved.
>  Memory-aliasing can be a difficult task, especially if post-
>  reorganization linking is supported. How does the CRISP
>  reorganizer address this issue ?
>
>  Simple reorganizers ( a contradiction ) deal with memory aliasing
>  by forcing serialization of loads with respect to stores. If
>  your registers are accessable as memory and you use this scheme
>  in your reorganization, you wind up serializing every instruction
>  with respect to stores. What's the cost of this ?

Any ATT people out there that can answer this?

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385