Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!cornell!batcomputer!pyramid!voder!apple!bcase From: bcase@apple.UUCP (Brian Case) Newsgroups: comp.arch Subject: Re: Divides (Was RE: What should be in hardware but isn't) Message-ID: <6358@apple.UUCP> Date: Sat, 26-Sep-87 18:55:54 EDT Article-I.D.: apple.6358 Posted: Sat Sep 26 18:55:54 1987 Date-Received: Sun, 27-Sep-87 11:46:20 EDT References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP> <7460@steinmetz.steinmetz.UUCP> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc., Cupertino, USA Lines: 55 In article <7460@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes: >( All elipses ... are mine, and indicate excluded text. DMOC ) >In article <6336@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: >>... Its VERY difficult to make fixed point division run faster than >> a bit per cycle, without a LOT of hardware. By leaving out the >> special purpose speedup stuff, you can afford to include some VERY >> useful general purpose speedup stuff: More registers ... branch folding ... > > This is not really true. If you have a fast multiplier ( which is > a good idea for many applications ) you can do division very much > quicker than one cycle per bit, relatively easily, especially for > long word lengths. In fact, you can do division in something like > > C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K) > > where C and K are positive integer small constants dependant on > how you implement your algorithm. The technique to use is > Newton-Raphson iteration with a first-guess look-up table. > "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)" > The additional hardware needed (besides a fast multiplier) is TWIT. You are neglecting to realize that for many (most?) implementations, a fast (i.e. parallel) multiplier *is* a LOT of hardware. If you are fitting a sufficiently big multiplier on your chip (I want to talk only about single-chip implementations), then great, let's all use your technology. Yes, Newton-Raphson is used lots of places (e.g. Am29300 family, Am29027 floating-point accellerator that I know of personally). >>The ATT CRISP doesn't have any registers. But, by caching the top of >>the local frame, references to locals are effectively turned into >>register references, and you get register windows as well. > > One of the NICE things about registers that are NOT accessable > as memory is that you can uniquely identify references to a register > based strictly on the bits in the instruction stream. This is > crucial to reorganization : you must know when registers are > modified as a limit on how uses of that register can be moved. > Memory-aliasing can be a difficult task, especially if post- > reorganization linking is supported. How does the CRISP > reorganizer address this issue ? Registers in the CRISP are *completely* transparent to software (well, I bet there is some way in which they are not completely, but for the purposes of this discussion, they are *completely*). To a compiler, reorganizer, linker, etc. a register reference is a memory reference. It *does* behoove the compiling system components to know that memory references with small offsets (less than 32 words away in the implementation that I know of) from the Stack Pointer will run much faster (so that local variables can be allocated to have fast access, etc.), but this is not necessary. Reorganizers have no artificial constraints placed on them by the CRISP. Personally, I think the implementation cost of the CRISP model is too high, but the transparency acheived is wonderful (the register file can be expanded/shrunk at will, even a 0 size will work just fine).