Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uwm.edu!bionet!agate!riacs!pioneer.arc.nasa.gov!lamaster From: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) Newsgroups: comp.arch Subject: Re: RISC vs. CISC -- SPECmarks Message-ID: <1991May2.171755.18612@riacs.edu> Date: 2 May 91 17:17:55 GMT References: <11412@mentor.cc.purdue.edu> Sender: news@riacs.edu Reply-To: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) Organization: RIACS, NASA Ames Research Center Lines: 51 In article , mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: |> >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said: |> |> Hugh> PROPHECY: One of these days, a single-chip microprocessor will |> Hugh> have vector instructions, and then the advantages and : : |> I don't see much benefit to explicit vector instructions compared to |> tight loops with zero cycle branches (like the RS/6000). They sure |> can eat up a lot of silicon space, though.... In principle, a superscalar implementation with a really smart vectorizing compiler can do as well as a vector machine. In order to do so, however, it would need to be able to issue two uncached loads and an uncached store every cycle, as well as a multiply and an add (this latter has been now on several machines). Given the latency of non-cached memory, which is usually greater than 4 cycles, and may even be up to 32 cycles, this would require the system to keep a large number of pending loads and stores, as well as a large number of registers (how many depends on what the latency of main memory is). I believe that vector instructions would actually prove to be *much* easier to implement than a CPU with 20+ pending loads and stores, issuing five new instructions per CPU cycle... |> The big problem is that the memory bandwidth required for vector FP is |> expensive and is not likely to contribute substantially to the non-FP |> performance. Without adequate memory bandwidth, there is not really |> any need for vector instructions, since the cpu is idle (waiting for |> cache refills) for plenty of time to do loop control.... I agree that the memory subsystem is a major problem, but I am not sure that it is as bad as assumed above. I agree that a vector ISA is useless without the memory bandwidth to back it up. But, what I envision is this: vector loads/stores can very conveniently avoid modifying cache (unless the location is already cached), and, the latencies on some cached systems are already fairly long, with fairly high bandwidths during cache refills. Bandwidth by itself is not all that expensive; what IS expensive is low latency high bandwidth memory systems. Given some of the cache refill strategies on current machines, feeding a vector load/store unit would not be that big a deal. The difference here is that ideally you would want three or four load store units, and need to maintain cache coherence at the same time. -- Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster NASA Ames Research Center Internet: lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 With Good Mailer: lamaster@george.arc.nasa.gov Phone: 415/604-6117 #include