Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uwm.edu!bionet!agate!riacs!pioneer.arc.nasa.gov!lamaster
From: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster)
Newsgroups: comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Message-ID: <1991May2.171755.18612@riacs.edu>
Date: 2 May 91 17:17:55 GMT
References: <TH_A6-F@xds13.ferranti.com> <11412@mentor.cc.purdue.edu> <MCCALPIN.91May2095930@pereland.cms.udel.edu>
Sender: news@riacs.edu
Reply-To: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster)
Organization: RIACS, NASA Ames Research Center
Lines: 51

In article <MCCALPIN.91May2095930@pereland.cms.udel.edu>, mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
|> >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:
|> 
|> Hugh> PROPHECY: One of these days, a single-chip microprocessor will
|> Hugh> have vector instructions, and then the advantages and
:
:
|> I don't see much benefit to explicit vector instructions compared to
|> tight loops with zero cycle branches (like the RS/6000).  They sure
|> can eat up a lot of silicon space, though....

In principle, a superscalar implementation with a really smart vectorizing
compiler can do as well as a vector machine.  In order to do so, however,
it would need to be able to issue two uncached loads and an uncached
store every cycle, as well as a multiply and an add (this latter has 
been now on several machines).  Given the latency of non-cached memory,
which is usually greater than 4 cycles, and may even be up to 32 cycles,
this would require the system to keep a large number of pending loads
and stores, as well as a large number of registers (how many depends on
what the latency of main memory is).  I believe that vector instructions
would actually prove to be *much* easier to implement than a CPU with 20+
pending loads and stores, issuing five new instructions per CPU cycle...

 
|> The big problem is that the memory bandwidth required for vector FP is
|> expensive and is not likely to contribute substantially to the non-FP
|> performance.  Without adequate memory bandwidth, there is not really
|> any need for vector instructions, since the cpu is idle (waiting for
|> cache refills) for plenty of time to do loop control....

I agree that the memory subsystem is a major problem, but I am not sure
that it is as bad as assumed above.  I agree that a vector ISA is useless
without the memory bandwidth to back it up.  

But, what I envision is this:
vector loads/stores can very conveniently avoid modifying cache (unless
the location is already cached), and, the latencies on some cached systems
are already fairly long, with fairly high bandwidths during cache refills.
Bandwidth by itself is not all that expensive; what IS expensive is low
latency high bandwidth memory systems.  Given some of the cache refill
strategies on current machines, feeding a vector load/store unit would
not be that big a deal.  The difference here is that ideally you would want
three or four load store units, and need to maintain cache coherence at
the same time.


-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>