Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!oakhill!marvin From: marvin@oakhill.UUCP (Marvin Denman) Newsgroups: comp.arch Subject: Re: Integer multiply and killer micros Message-ID: <2810@yogi.oakhill.UUCP> Date: 10 Jan 90 23:52:40 GMT References: <158@csinc.UUCP> <787@stat.fsu.edu> <42701@lll-winken.LLNL.GOV> <5842@ncar.ucar.edu> <490@qusunl.queensu.CA> <34259@mips.mips.COM> Reply-To: cs.utexas.edu!oakhill!marvin (Marvin Denman) Organization: Motorola Inc., Austin, Texas Lines: 49 In article (Michael Meissner) writes: >In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >| R6000: 16 cycles 32x32 -> 64 >| R3000: 12 cycles 32x32 -> 64 >| M88000: 4 cycles 32x32 -> 32 ** >| >| **88k computes the 32 lsb's of the 64b product (upper bits are discarded). > >Actually if I remember chapter 7 of the 88100 user's manual, a >multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST, >and 1 writeback). Logically, the writeback phase should be available >to be feed forward, which logically shaves off 1 cycle. However, >since non of the floating point operations do feed forwarding, I >wouldn't be surprised if integer multiply/divide don't feed forward >either. As alluded to in an earlier article, multiple multiplications >can be done in parallel, since each cycle, the multiplier advances the >pipeline. Floating point adds can similarly be pipelined. I think you may have a slight parity error in your memory. Integer multiply on the 88100 does take 4 cycles of latency just like Mark Johnson stated. Single precision floating point multiply however takes 6 cycles latency. Your comments about feed forward confuse me slightly. The latency numbers that Motorola quotes are all in terms of when another instruction can use the result so they already account for feed forward. Perhaps the source of your confusion is that double precision results have one extra clock of latency when feeding forward into store double instructions because of implementation considerations. Double precision results do feed forward with no delay to other double precision operations. A key point of the 88100 design was its pipelined design. Single precision multiply and add along with integer multiply are fully pipelined. Double precision operations are pipelined except for stalling for one clock on either end of the pipe. This allows very high throughput on well scheduled code. Currently compilers do very little in the way of code scheduling for floating point. (At least the code I have looked at) When compilers begin to take advantage of the 88100 floating point pipelines there will be a jump in performance. The current compilers are already making progress which will be seen in the newest release of the SPEC benchmark numbers when they are published. There is still much room for improvement. Marvin Denman Motorola 88000 Design -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin