Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!sdd.hp.com!spool.mu.edu!uunet!stanford.edu!agate!riacs!pioneer.arc.nasa.gov!lamaster
From: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster)
Newsgroups: comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Message-ID: <1991May7.195913.27363@riacs.edu>
Date: 7 May 91 19:59:13 GMT
References: <1991Apr30.163153.18568@midway.uchicago.edu> <1991May2.162909.9165@news.arc.nasa.gov> <819@cadlab.sublink.ORG> <1991May7.052417.10606@leland.Stanford.EDU>
Sender: news@riacs.edu
Reply-To: lamaster@pioneer.arc.nasa.gov (Hugh LaMaster)
Organization: RIACS, NASA Ames Research Center
Lines: 65

In article <1991May7.052417.10606@leland.Stanford.EDU>, dhinds@elaine18.Stanford.EDU (David Hinds) writes:
|> In article <819@cadlab.sublink.ORG> martelli@cadlab.sublink.ORG (Alex Martelli) writes:
|> >lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
|> >	...
:
:
etc etc:   Even worse, code which was previously optimal for vector machines, and which
|> >:was OK on a wide variety of other machines, is now pessimal for these machines.


|> >Not really so new - I was optimizing codes for the cache in '87 for an IBM
|> >3090 with VF... ok, there ARE problems (the curve of leading dimension of
|> >array versus megaflops

 
|> You're still a long way off.  My *father* was optimizing Fortran matrix codes
|> to exploit the cache on the IBM 370/195, in the (guess?) mid-70's.  On that


Both posters have essentially the same point, and this point is well taken.
Machines with cache (and other locality-friendly) devices have been around a *long*
time.  Even the 360/67 got a boost from code rearrangement, due to the DAT box
(Dynamic Address Translation == "TLB", sort of) overhead if you accessed arrays
the wrong way.  On the new RISCs, the effect is extremely strong.  Combined with
some of the vector-ish features of these machines, optimal codes can look like a hybrid
of the cache and vector techniques, which makes them rather non-intuitive.

I agree that this is nothing new.  The major problem of all computer architects from
the beginning is where to put the bandwidth.  The new RISC-with-fast-cache machines
have properties somewhat like a minicomputer with an attached array processor.  If
your problem is well suited to this, you can get phenomenal speedups very cheaply.  If
your problem does not have such locality, but is still vectorizable, a 
"vector supercomputer" architecture may be a better approach.

What I am really arguing in favor of is a machine which combines both.  There is no
reason why you can't have a machine with both a superscalar CPU driven mainly off
cache, and a vector load-store architecture that can access secondary memory 
directly.  Then, you get the best of both worlds.  The question is when will this
be done on a microprocessor?

In answer to the sometimes heard statement that "superscalar makes vector
obsolete", the answer is that it *could*, just as a very fast Turing machine
could also.  In order to actually *do it*, however, the load/store architecture
will have to be expanded considerably.  No one has yet succeeded in getting
that much concurrency going in a superscalar machine.  But, I wouldn't argue
that it couldn't be done.  In fact, I would like to see it.

In answer to the other criticism, that VLIW machines make vector obsolete,
I agree.  The Multiflow architecture could have potentially made "vector"
machines obsolete.  In fact, it is really too bad that they went out of
business.  Someone ought to be working on a single chip VLIW, if they aren't
already.  But, I haven't heard of anyone.  In many ways, VLIW seems to be 
a simpler and more general form of "vectorization".


|>  -David Hinds
|>   dhinds@cb-iris.stanford.edu

-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>