Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!rphroy!caen!uwm.edu!linac!midway!quads.uchicago.edu!rtp1
From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert)
Newsgroups: comp.arch
Subject: Re: Vector vs Cache/Superscalar
Message-ID: <1991May6.055943.6234@midway.uchicago.edu>
Date: 6 May 91 05:59:43 GMT
References: <1991May4.031835.7979@midway.uchicago.edu> <MCCALPIN.91May5075614@pereland.cms.udel.edu> <1991May6.035310.26794@marlin.jcu.edu.au>
Sender: news@midway.uchicago.edu (NewsMistress)
Organization: University of Chicago
Lines: 60


>recurrence, in full vector mode. Many old scalar problems can now be
>vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this
>capability too.

ETA is irrelevant, except for comp.folklore.computers, and I don't 
know about the current crop of Crays, but the new Alliants
(FX/800, FX/2800) are based on the i860 chip, and so they do
their "vectorization" by generating pipelined code for the
RISC chip.  The i860 has no problems with recursion, except
for possible delays in re-use of a result (during which you
usually want to store something else, or do another computation
not dependent on the result).

I wonder how the Fujitsu handles the recursion.  For long enough
vectors to make it worthwhile, there is a pretty well known hack
to make sum-reduction vectorizable (split the vector in two,
do a vector add, split again, etc.).  Works for dot products as
well, of course.  The RISC architectures can handle a much
more general kind of recursion.

Concerning McAlpin's comments on vectorizability and maintainability
of code, I'm not claiming that the codes I'm talking about are
intrinsically unvectorizable.  I'm just giving some examples from
one particular (and perhaps rather lazy, as far as coding goes)
real-world user.  My experience also reflects that of a lot of
postdocs and students that have worked for me.  We write a
lot of banged-out code, get interesting results, and somehow
never get around to doing much optimization until the optimization
becomes more interesting to do than looking at the scientific
results (at which point, we've basically moved onto something 
else anyway).  My own experience with the IBM R6000 architecture
is that we have gotten more payoff from a little optimization effort,
and that it is easier to plug in optimized ESSL routines.  

Tridiagonal solving is a case in point.  True, in fluids codes,
you can always vectorize in the direction(s) perpendicular to
the recursion.  On the other hand, this still means you have
to do a little thinking to write a special purpose tridiag 
solver suited to your particular storage configuration. Not
really all that hard, but I find it a lot easier to have
a super-optimized canned tridiag solver I don't even have
to recompile, and then simply do a bunch of calls to this
solver; this code structure lays the groundwork for 
parallelization as well. True, you do still have to think
about storage layout because of the performance hit for non-unit
strides on cache-based architectures (wish somebody would build
a cache that could pre-load more general patterns than contiguous
memory).

While I'm waxing nostalgic, some of the codes I have seen for
cache-based machines have seemed hauntingly familiar, and I 
have only recently realized why:  They hark back to the 70's,
when computers had too limited RAM do hold your field, and
so doing fluid dynamics on computers meant loading stuff into
memory from disk, crunching the life out of it, and squirting
the result back to disk.  There was of course a huge premium
on re-use of data (so I'm told-- this was really before my
time, of course).
.