Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!pt.cs.cmu.edu!andrew.cmu.edu!+ From: Daniel.Stodolsky@cs.cmu.edu Newsgroups: comp.arch Subject: Integer multiply and killer micros Message-ID: Date: 4 Jan 90 16:45:53 GMT References: <158@csinc.UUCP> <787@stat.fsu.edu> <42701@lll-winken.LLNL.GOV> <5842@ncar.ucar.edu>, <490@qusunl.queensu.CA> Organization: Carnegie Mellon, Pittsburgh, PA Lines: 20 In-Reply-To: <490@qusunl.queensu.CA> If my memory serves me correctly, one should be able to compute a 32 x 32 -> 64 bit multiply with four 16x16 -> 32 multiplies, 4 32 bit adds and a few shits. So why not put some of big memory killer micros to work and have a 16 by 16 multiply lookup table? It would consume 10 megs of core, but that's nothing for a KILLER MICRO. Assume memory access with a cache miss is around 3 cycles and one can schedule to avoid register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64 ( results in registers) in about 20 cycles. This huge table could availible as a shared read only data segment, so every process wouldn't need its own copy. Comments? Daniel Stodolsky Engineering Design Research Center Carnegie Mellon University danner@cs.cmu.edu