Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!littlei!intelhf!ichips!ichips!colwell
From: colwell@pdx023.pdx023 (Robert Colwell)
Newsgroups: comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Message-ID: <COLWELL.91May3111802@pdx023.pdx023>
Date: 3 May 91 10:18:02 GMT
References: <TH_A6-F@xds13.ferranti.com> <11412@mentor.cc.purdue.edu>
	<MCCALPIN.91May2095930@pereland.cms.udel.edu>
	<1991May2.171755.18612@riacs.edu>
Sender: news@ichips.intel.com (News Account)
Organization: Intel Corp., Hillsboro, Oregon
Lines: 59
In-Reply-To: lamaster@pioneer.arc.nasa.gov's message of 2 May 91 17:17:55 GMT

In article <1991May2.171755.18612@riacs.edu> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

   In principle, a superscalar implementation with a really smart vectorizing
   compiler can do as well as a vector machine.  In order to do so, however,
   it would need to be able to issue two uncached loads and an uncached
   store every cycle, as well as a multiply and an add (this latter has 
   been now on several machines).  Given the latency of non-cached memory,
   which is usually greater than 4 cycles, and may even be up to 32 cycles,
   this would require the system to keep a large number of pending loads
   and stores, as well as a large number of registers (how many depends on
   what the latency of main memory is).  I believe that vector instructions
   would actually prove to be *much* easier to implement than a CPU with 20+
   pending loads and stores, issuing five new instructions per CPU cycle...

You're talking micros here, right?  The machines we designed and built at
Multiflow (and occasionally SOLD) and Cydrome's machines did all this routinely.
Designing the CPU per se has a whole lot of different tradeoffs than more
conventional machines, but then so do vector machines.  I'd much rather design
another VLIW than a vector machine; much of the hard stuff gets shoved off to
the compiler (in the RISC style don't you know.)

Maybe you're assuming one couldn't or wouldn't want to build such a machine onto
a single chip (or small set of chips; comp.arch doesn't seem to uniformly
distinguish between these two possibilities).  For object code compatibility
reasons alone you might not want to make the SW/HW tradeoffs in exactly the same
way as we did at MFCI.  But on the other hand, in a world where folks have DOS
emulators running on RISCs, the rules are not always what they seem.

   |> The big problem is that the memory bandwidth required for vector FP is
   |> expensive and is not likely to contribute substantially to the non-FP
   |> performance.

   I agree that the memory subsystem is a major problem, but I am not sure
   that it is as bad as assumed above.

We all agree this is a major problem, then.  Supercomputers look sick on SPEC
benchmarks mostly because of this, as far as I can tell.  Those expensive
boatloads of fast RAMs don't help much once you have enough to feed the
benchmarks, but they still burn power and make the machine super-pricey.

   The difference here is that ideally you would want
   three or four load store units, and need to maintain cache coherence at
   the same time.

This is a big part of the problem.  You could consider not maintaining cache
coherence from all of the ports, but that is untried SW/HW territory, and the SW
folks will have be resuscitated once you suggest it to them.  (It would sure
help the HW guys though.)  Hey, while we're at it, why don't we snoop a few
buses too, and build a multiprocessor out of this.  Somebody will want to,
that's a given.  

Good luck.  This is all possible, but the bad news is the obvious news: it'll
cost you design time, product cost, and performance.  Quite possibly more than
it's worth.

Bob Colwell  colwell@ichips.intel.com  503-696-4550
Intel Corp.  JF1-19
5200 NE Elam Young Parkway
Hillsboro, Oregon 97124