Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!rphroy!caen!uwm.edu!linac!midway!quads.uchicago.edu!rtp1 From: rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) Newsgroups: comp.arch Subject: Re: Vector vs Cache/Superscalar Message-ID: <1991May6.055943.6234@midway.uchicago.edu> Date: 6 May 91 05:59:43 GMT References: <1991May4.031835.7979@midway.uchicago.edu> <1991May6.035310.26794@marlin.jcu.edu.au> Sender: news@midway.uchicago.edu (NewsMistress) Organization: University of Chicago Lines: 60 >recurrence, in full vector mode. Many old scalar problems can now be >vectorized. I'd be interested to hear if Cray/ETA/Alliant etc have this >capability too. ETA is irrelevant, except for comp.folklore.computers, and I don't know about the current crop of Crays, but the new Alliants (FX/800, FX/2800) are based on the i860 chip, and so they do their "vectorization" by generating pipelined code for the RISC chip. The i860 has no problems with recursion, except for possible delays in re-use of a result (during which you usually want to store something else, or do another computation not dependent on the result). I wonder how the Fujitsu handles the recursion. For long enough vectors to make it worthwhile, there is a pretty well known hack to make sum-reduction vectorizable (split the vector in two, do a vector add, split again, etc.). Works for dot products as well, of course. The RISC architectures can handle a much more general kind of recursion. Concerning McAlpin's comments on vectorizability and maintainability of code, I'm not claiming that the codes I'm talking about are intrinsically unvectorizable. I'm just giving some examples from one particular (and perhaps rather lazy, as far as coding goes) real-world user. My experience also reflects that of a lot of postdocs and students that have worked for me. We write a lot of banged-out code, get interesting results, and somehow never get around to doing much optimization until the optimization becomes more interesting to do than looking at the scientific results (at which point, we've basically moved onto something else anyway). My own experience with the IBM R6000 architecture is that we have gotten more payoff from a little optimization effort, and that it is easier to plug in optimized ESSL routines. Tridiagonal solving is a case in point. True, in fluids codes, you can always vectorize in the direction(s) perpendicular to the recursion. On the other hand, this still means you have to do a little thinking to write a special purpose tridiag solver suited to your particular storage configuration. Not really all that hard, but I find it a lot easier to have a super-optimized canned tridiag solver I don't even have to recompile, and then simply do a bunch of calls to this solver; this code structure lays the groundwork for parallelization as well. True, you do still have to think about storage layout because of the performance hit for non-unit strides on cache-based architectures (wish somebody would build a cache that could pre-load more general patterns than contiguous memory). While I'm waxing nostalgic, some of the codes I have seen for cache-based machines have seemed hauntingly familiar, and I have only recently realized why: They hark back to the 70's, when computers had too limited RAM do hold your field, and so doing fluid dynamics on computers meant loading stuff into memory from disk, crunching the life out of it, and squirting the result back to disk. There was of course a huge premium on re-use of data (so I'm told-- this was really before my time, of course). .