Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!steinmetz!sunray!oconnor From: oconnor@sunray.steinmetz (Dennis Oconnor) Newsgroups: comp.arch Subject: Re: Divides (Was RE: What should be in hardware but isn't) Message-ID: <7481@steinmetz.steinmetz.UUCP> Date: Wed, 31-Dec-69 18:59:59 EDT Article-I.D.: steinmet.7481 Posted: Wed Dec 31 18:59:59 1969 Date-Received: Thu, 1-Oct-87 03:29:13 EDT References: <581@l.cc.purdue.edu> <18336@amdcad.AMD.COM> <582@l.cc.purdue.edu> <6336@apple.UUCP> <7460@steinmetz.steinmetz.UUCP> <6370@apple.UUCP> Sender: root@steinmetz.steinmetz.UUCP Reply-To: oconnor@sunray.UUCP (Dennis Oconnor) Organization: General Electric CRD, Schenectady, NY Lines: 121 In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: >Are there any reference to the GE RISC architecture? Its a new one to me. There is a paper in the upcoming GOMAC which is the first general disclosure. The system had a blurb in AW&ST about 18 months ago. >I said: Microcode, or nanocode, has to go through all the same >operations that assembly level code does. > >> Except fetching instructions, and operations resulting from >> the need to handle interupts or exceptions at arbitrary points >> in the assembly code (microcode can lock excepts out till it completes) > >Assembly language doesn't fetch instructions, hardware does it >automatically. Generally its microcode that has to fetch the assembly >language instructions, and hardware that fetches the microcode. Is >there something to this analogy I'm missing? You're not clear. Replacing a microcode sequence with an assembly language sequence increases the number of instructions that need to be fetched to perform the operation. Good cache design can minimize this penalty. >Assembly language is perfectly capable of locking out interrupts. You >may have to be in supervisor state or something to do it, but every >machine I've every seen with interrupts has a way to turn them off. >This is far more flexible than having only specific microcode routines, which >are locked into ROM, be able to turn off interrupts. Microcode doesn't need to be in supervisor state to temporarily disable interupts, doesn't need the user to know interupts need disabling, and doesn't prevent you from providing an interupt-disable to the user. So what does microcode lose you in this respect? This is NOT a CISC/RISC argument : I favor RISC, but let's not kid ourselves that CISCs don't have some advantages. As we all know. That's like comparing trains to trucks : there are wins for each. >I said: Its VERY difficult to make fixed point division run faster than >a bit per cycle, without a LOT of hardware. By leaving out the >special purpose speedup stuff, you can afford to include some VERY >useful general purpose speedup stuff: More registers ... branch folding ... > >> This is not really true. If you have a fast multiplier ( which is >> a good idea for many applications ) you can do division very much >> quicker than one cycle per bit, relatively easily, especially for >> long word lengths. In fact, you can do division in something like >> >> C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K) >> >> where C and K are positive integer small constants dependant on >> how you implement your algorithm. The technique to use is >> Newton-Raphson iteration with a first-guess look-up table. >> "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)" >> The additional hardware needed (besides a fast multiplier) is TWIT. > >I am not unaware of the Newton-Raphson divide algorithm. If you >think that the look-up table, or the fast multiplier, or the datapath >logic you have to put around the multiplier to do the Newton-Raphson >iteration is trivial, then you've haven't designed one. I have. It >isn't. It is also not necessarily faster than a bit at a time >divider, depending on your fast multiply time, word length, look-up >table, etc. I've been told (by someone who was there) that although >Amdahl insisted on this approach in the 470, its was discovered >afterwards that it ran slower than the usual serial approach would have. I specifically stated that IF you had a fast multiplier around, this was easy, so saying fast multipliers are non-trivial doesn't contradict me. Now, I HAVE designed everything you say. Really. It has the look-up table in random logic ( only need a few bits ). Has a clever normalize-denominator instruction (needed). Does the iteration in assembly language (it's a RISC machine) so no additional datapath is required. And will be in silicon RSN. I hate "you haven't designed one, I have" arguments. Well, let's make it clear : I (and the rest of the GE team) HAVE DESIGNED A RISC PROCESSOR WITH NR-ITERATION FOR DIVISION. Details will have to wait for the official public release. But since YOU have designed one (I guess) and I have designed one, and we disagree, then we must have been designing in different contexts (i.e. speed targets, architecture limits, implementation technology ... ). You show me you design and context and I'll show you mine (:-). ( side note : don't you think some designers get a raw deal? I mea here is some world-class designer at Intel, and he has to design a new competive processor that's upward-compatable with the 8086. Will the CS world look at his design and say, "Wow, what a great job he did given the constraints he was under!"? No, most of the CS world will just say "What a Kludge!". We're talkin unfair here. ) I never said NR-Iteration was faster than the serial method, I simply gave an equation for predicting how fast it was. Given an equation for serial division, like : C2 + (subtract_latency * word_length), it's relatively easy to see that sometimes one is faster, sometimes the other is. No surprise. For us, NR-Iteration was the winner. Your mileage may very. But do I care how fast you do an 8-bit divide? No. I said in an article: >>Of course, the PDP10 didn't have to reorganize code, so it did >>not have to deal with memory-aliasing problems. In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: >The PDP-10 didn't have to re-organize code. Neither do RISC >architectures. Neither do 370's. But, you might be surprised to find >that on these machines, re-organizating the code to eliminate >pipeline interlocks will speed up your code. [Reference deleted] >{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385 Sure, RISCs don't need to reorganize, and CRAYs don't need hard disks, floppies will work just fine! Semantic Baloney. If by "needs" we mean "must have in order to be commercially viable", or "must have to show any advantage over other architecture classes", then RISCs do need reorganizers. I never said other classes of architecture (like CISCs) couldn't benefit from them, but they don't exhibit as large a performance penalty when you don't use one as RISCs do. And before you say "You haven't written a reorganizer.", well, it's not finished, but I am on a team writing one NOW. (:-) -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"