Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!rice!ariel.rice.edu!preston From: preston@ariel.rice.edu (Preston Briggs) Newsgroups: comp.arch Subject: Re: what if the i860 had vector registers... ? Message-ID: <1991May29.112250.22178@rice.edu> Date: 29 May 91 11:22:50 GMT References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU> <13990@exodus.Eng.Sun.COM> <1991May28.190220.22150@murdoch.acc.Virginia.EDU> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 66 gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes: >This is what kills the iPSC i860 box -- my code is guaranteed to run >at the rate of the memory system, with no useful caching effects. This problem kills most everything except vector machines (ala Cray). That's why BLAS-2 and BLAS-3 routines were invented (i.e., to provide higher-level operations with enough data reuse that the memory system can keep up with the FP unit). Try saxpy's on an HP or IBM when then data is not in cache. They aren't going to get close to peak performance either. -------------------- >For example: > >do i = 1, 128 > z(i) = a * x(i) + y(i) >enddo >With the current scheme you unroll by 4, load 4 x's, load 4 y's taking >a page-miss penalty, while calculating the results, and then write the >results back taking another page miss. That's 3 memory system page >misses per cycle. ^^^^^ iteration Generally, you won't be able to use the load/store quad instructions because it's difficult to be assured of alignment. It's better to handle the above case by copying x into an aligned chunk of memory (called VR1, say) using pipelined loads (4 in a row) and store-quads. Then copy y into VR2. Finally, calculate the sum into VR3 (using quad-loads and stores). Then copy VR3 back to z using quad-loads and singleton stores. VR3 might be identical with either VR1 or VR2. Generally, the VRs will end up in cache. Naturally, we could use subroutines for the copying between memory and cache/VRs and for the vector add. ------------ Of course, Intel though of all this. They do it, using a traditional vectorizing front-end. The problem is that there's still insufficient memory bandwidth to support vector operations at 80 MFlops. That kind of performance requires hefty $$s. ------------ Note also that the "page misses" mentioned above and in Moyer's TR are on a particular board, build around 8 MBytes of dynamic RAM. Other boards have been built with static RAM ($$'s) that don't have the problem of crossing page boundaries. ------------- I think the difficulties with the i860 lie more in generating code for the exposed pipelines and handling their unhelpful multiply-add instruction. The bandwidth questions are going to afflict all the micro's for a while. Preston Briggs