Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.arch
Subject: Re: Vector vs Cache/Superscalar
Message-ID: <MCCALPIN.91May5075614@pereland.cms.udel.edu>
Date: 5 May 91 11:56:14 GMT
References: <1991May4.031835.7979@midway.uchicago.edu>
Sender: usenet@ee.udel.edu
Organization: College of Marine Studies, U. Del.
Lines: 55
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: rtp1@quads.uchicago.edu's message of 4 May 91 03:18:35 GMT

> On 4 May 91 03:18:35 GMT, rtp1@quads.uchicago.edu (raymond thomas
	 pierrehumbert) said:

raymond> McAlpin comments that he finds vectorization (even on the
raymond> Cyber 205) simpler, more intuitive and more transportable
raymond> than the optimization techniques used on cached machines like
raymond> the RS/6000.

raymond> (a)  I have some semi-spectral 2D fluid codes (finite diff in
raymond> one direction, spectral in the other) which I never got
raymond> around to optimizing on the Cyber, because it would have involved
raymond> some major structural changes. 

I have had no trouble optimizing similar codes for the ETA-10.  Of
course, my codes were designed with long vectors in mind, and it
sounds like yours were not.  It also helps if the problems are large
enough.  On the ETA-10, one did not get 90% of full speed on FFT's
until you were calculating a minimum of several hundred of them
simultaneously.   I agree that it was often very difficult to
optimizing codes written for other machines on the Cyber.  But I found
the long vector programming model so easy to use that I often re-wrote
the applications from scratch faster than I could have "optimized" the
original... 

raymond> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
raymond> a real vector-breaker.  In fact, vector machines choke on all
raymond> sorts of recursion, whereas the superscalars just love them.
raymond> On the RS/6000, the tridiag code basically vanished, whereas on
raymond> the vector Stardent, it was a bottleneck.

In fluid dynamics codes, tridiagonal systems almost always arise in
groups (eg. one system per row of the 2-D domain).  In these cases it
is trivial to vectorize sideways across the systems and get full
vector performance.  You still have a vector divide or two in there,
but as long as you are not working on an IBM 3090VF that should not be
too much of a problem. ;-)

Once again, I agree that (for example) the RS/6000 is a wonderful
machine.  I am now completing the optimization of a 2-D finite
difference vorticity model which (I currently estimate) will run at a
sustained 15 MFLOPS (64-bit) on the RS/6000-320.  *But* the
optimizations which enabled me to do this are *much* uglier and lead
to *much* less maintainable code than the original long vector code.
If I did not anticipate a need for 200 days wall time to complete the
work with the original code, I would not have considered these
optimizations worthwhile.

(The optimizations that I use are: (1) use IBM ESSL routines for FFT's
and tridiagonal solves; (2) inner loop unrolling; (3) outer loop
unrolling followed by inner loop jamming/fusion; (4) subroutine
inlining followed by manual loop jamming followed by unrolling.)
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET