Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!uwm.edu!ux1.cso.uiuc.edu!midway!quads.uchicago.edu!rtp1 From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) Newsgroups: comp.arch Subject: Vector vs Cache/Superscalar Message-ID: <1991May4.031835.7979@midway.uchicago.edu> Date: 4 May 91 03:18:35 GMT Sender: news@midway.uchicago.edu (NewsMistress) Organization: University of Chicago Lines: 38 McAlpin comments that he finds vectorization (even on the Cyber 205) simpler, more intuitive and more transportable than the optimization techniques used on cached machines like the RS/6000. I think this is partly because the vector model of parallelism is so rigid; optimization for the superscalars involves a bigger bag of tricks. Still, I have found that there are fewer things they choke on, and that it is easier to localize optimization in a few reusable routines. Two case-studies: (a) I have some semi-spectral 2D fluid codes (finite diff in one direction, spectral in the other) which I never got around to optimizing on the Cyber, because it would have involved some major structural changes. On the other hand, on the RS/6000, i860 based machines, and even my hated DN10000, 1D FFT's scream right along at nearly the machine's top speed (lots of data re-use). In this case, a simple plug-in of canned FFT's gave a major speed-up. (b) Tridiagonal solving. Comes up in lots of codes, and it is a real vector-breaker. In fact, vector machines choke on all sorts of recursion, whereas the superscalars just love them. On the RS/6000, the tridiag code basically vanished, whereas on the vector Stardent, it was a bottleneck. A third example that occurs to me is evaluation of transcendental functions. Lots of recursion, and pretty efficient on the RISCS. On a vector machine, you have to keep iterating the vector until the slowest converging argument is done converging (unless you do a lot of reshuffling in memory) Now, the $64 question: Why no supercomputer based on an architecture for the processor like the RS/6000, BUT with your extra $2M buying bandwidth to memory like the Cray's (no cache)? This would seem to be a real winner. You could simulate vectorization on it, but it would have all the flexibility of the newer machines. .