Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!littlei!intelhf!ichips!ichips!colwell From: colwell@pdx023.pdx023 (Robert Colwell) Newsgroups: comp.arch Subject: Re: RISC vs. CISC -- SPECmarks Message-ID: Date: 3 May 91 10:18:02 GMT References: <11412@mentor.cc.purdue.edu> <1991May2.171755.18612@riacs.edu> Sender: news@ichips.intel.com (News Account) Organization: Intel Corp., Hillsboro, Oregon Lines: 59 In-Reply-To: lamaster@pioneer.arc.nasa.gov's message of 2 May 91 17:17:55 GMT In article <1991May2.171755.18612@riacs.edu> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes: In principle, a superscalar implementation with a really smart vectorizing compiler can do as well as a vector machine. In order to do so, however, it would need to be able to issue two uncached loads and an uncached store every cycle, as well as a multiply and an add (this latter has been now on several machines). Given the latency of non-cached memory, which is usually greater than 4 cycles, and may even be up to 32 cycles, this would require the system to keep a large number of pending loads and stores, as well as a large number of registers (how many depends on what the latency of main memory is). I believe that vector instructions would actually prove to be *much* easier to implement than a CPU with 20+ pending loads and stores, issuing five new instructions per CPU cycle... You're talking micros here, right? The machines we designed and built at Multiflow (and occasionally SOLD) and Cydrome's machines did all this routinely. Designing the CPU per se has a whole lot of different tradeoffs than more conventional machines, but then so do vector machines. I'd much rather design another VLIW than a vector machine; much of the hard stuff gets shoved off to the compiler (in the RISC style don't you know.) Maybe you're assuming one couldn't or wouldn't want to build such a machine onto a single chip (or small set of chips; comp.arch doesn't seem to uniformly distinguish between these two possibilities). For object code compatibility reasons alone you might not want to make the SW/HW tradeoffs in exactly the same way as we did at MFCI. But on the other hand, in a world where folks have DOS emulators running on RISCs, the rules are not always what they seem. |> The big problem is that the memory bandwidth required for vector FP is |> expensive and is not likely to contribute substantially to the non-FP |> performance. I agree that the memory subsystem is a major problem, but I am not sure that it is as bad as assumed above. We all agree this is a major problem, then. Supercomputers look sick on SPEC benchmarks mostly because of this, as far as I can tell. Those expensive boatloads of fast RAMs don't help much once you have enough to feed the benchmarks, but they still burn power and make the machine super-pricey. The difference here is that ideally you would want three or four load store units, and need to maintain cache coherence at the same time. This is a big part of the problem. You could consider not maintaining cache coherence from all of the ports, but that is untried SW/HW territory, and the SW folks will have be resuscitated once you suggest it to them. (It would sure help the HW guys though.) Hey, while we're at it, why don't we snoop a few buses too, and build a multiprocessor out of this. Somebody will want to, that's a given. Good luck. This is all possible, but the bad news is the obvious news: it'll cost you design time, product cost, and performance. Quite possibly more than it's worth. Bob Colwell colwell@ichips.intel.com 503-696-4550 Intel Corp. JF1-19 5200 NE Elam Young Parkway Hillsboro, Oregon 97124