Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!uwm.edu!ux1.cso.uiuc.edu!midway!quads.uchicago.edu!rtp1
From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert)
Newsgroups: comp.arch
Subject: Vector vs Cache/Superscalar
Message-ID: <1991May4.031835.7979@midway.uchicago.edu>
Date: 4 May 91 03:18:35 GMT
Sender: news@midway.uchicago.edu (NewsMistress)
Organization: University of Chicago
Lines: 38

McAlpin comments that he finds vectorization (even on the Cyber 205)
simpler, more intuitive and more transportable than the optimization
techniques used on cached machines like the RS/6000.

I think this is partly because the vector model of parallelism is
so rigid; optimization for the superscalars involves a bigger bag
of tricks.  Still, I have found that there are fewer things they
choke on, and that it is easier to localize optimization in a few
reusable routines.  Two case-studies:

(a)  I have some semi-spectral 2D fluid codes (finite diff in
one direction, spectral in the other) which I never got
around to optimizing on the Cyber, because it would have involved
some major structural changes.  On the other hand, on the RS/6000,
i860 based machines, and even my hated DN10000, 1D FFT's scream
right along at nearly the machine's top speed (lots of data re-use).
In this case, a simple plug-in of canned FFT's gave a major
speed-up.

(b) Tridiagonal solving.  Comes up in lots of codes, and it is
a real vector-breaker.  In fact, vector machines choke on all
sorts of recursion, whereas the superscalars just love them.
On the RS/6000, the tridiag code basically vanished, whereas on
the vector Stardent, it was a bottleneck.

A third example that occurs to me is evaluation of transcendental
functions.  Lots of recursion, and pretty efficient on the RISCS.
On a vector machine, you have to keep iterating the vector until
the slowest converging argument is done converging (unless you
do a lot of reshuffling in memory)

Now, the $64 question:  Why no supercomputer based on an 
architecture for the processor like the RS/6000, BUT with
your extra $2M buying bandwidth to memory like the Cray's
(no cache)?  This would seem to be a real winner. You could
simulate vectorization on it, but it would have all the
flexibility of the newer machines.
.