Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.arch Subject: Re: Vector vs Cache/Superscalar Message-ID: Date: 5 May 91 11:56:14 GMT References: <1991May4.031835.7979@midway.uchicago.edu> Sender: usenet@ee.udel.edu Organization: College of Marine Studies, U. Del. Lines: 55 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: rtp1@quads.uchicago.edu's message of 4 May 91 03:18:35 GMT > On 4 May 91 03:18:35 GMT, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) said: raymond> McAlpin comments that he finds vectorization (even on the raymond> Cyber 205) simpler, more intuitive and more transportable raymond> than the optimization techniques used on cached machines like raymond> the RS/6000. raymond> (a) I have some semi-spectral 2D fluid codes (finite diff in raymond> one direction, spectral in the other) which I never got raymond> around to optimizing on the Cyber, because it would have involved raymond> some major structural changes. I have had no trouble optimizing similar codes for the ETA-10. Of course, my codes were designed with long vectors in mind, and it sounds like yours were not. It also helps if the problems are large enough. On the ETA-10, one did not get 90% of full speed on FFT's until you were calculating a minimum of several hundred of them simultaneously. I agree that it was often very difficult to optimizing codes written for other machines on the Cyber. But I found the long vector programming model so easy to use that I often re-wrote the applications from scratch faster than I could have "optimized" the original... raymond> (b) Tridiagonal solving. Comes up in lots of codes, and it is raymond> a real vector-breaker. In fact, vector machines choke on all raymond> sorts of recursion, whereas the superscalars just love them. raymond> On the RS/6000, the tridiag code basically vanished, whereas on raymond> the vector Stardent, it was a bottleneck. In fluid dynamics codes, tridiagonal systems almost always arise in groups (eg. one system per row of the 2-D domain). In these cases it is trivial to vectorize sideways across the systems and get full vector performance. You still have a vector divide or two in there, but as long as you are not working on an IBM 3090VF that should not be too much of a problem. ;-) Once again, I agree that (for example) the RS/6000 is a wonderful machine. I am now completing the optimization of a 2-D finite difference vorticity model which (I currently estimate) will run at a sustained 15 MFLOPS (64-bit) on the RS/6000-320. *But* the optimizations which enabled me to do this are *much* uglier and lead to *much* less maintainable code than the original long vector code. If I did not anticipate a need for 200 days wall time to complete the work with the original code, I would not have considered these optimizations worthwhile. (The optimizations that I use are: (1) use IBM ESSL routines for FFT's and tridiagonal solves; (2) inner loop unrolling; (3) outer loop unrolling followed by inner loop jamming/fusion; (4) subroutine inlining followed by manual loop jamming followed by unrolling.) -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET