Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!haven.umd.edu!uvaarpa!murdoch!astsun.astro.Virginia.EDU!gl8f From: gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) Newsgroups: comp.arch Subject: Re: what if the i860 had vector registers... ? Message-ID: <1991May28.190220.22150@murdoch.acc.Virginia.EDU> Date: 28 May 91 19:02:20 GMT References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU> <13990@exodus.Eng.Sun.COM> Sender: usenet@murdoch.acc.Virginia.EDU Organization: Department of Astronomy, University of Virginia Lines: 66 In article <13990@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes: >>So, what if the i860 instead had a small data cache, and pipelined >>loads into vector registers? It is relatively easy to design a >>compiler that can do a good job with vector registers. And you would >>be able to do pipelined loads of data to be re-used. And you wouldn't >>end up starved for registers. > >Can you be more explicit about what you mean by vector registers? I was thinking of something like Cray vector registers, or just explicitly-managed cache, with pipelined loads into cache instead of the current non-pipelined loads into cache. The the ALU would be cache-to-cache instead of register-to-register. >One constraint on elementary row operations on the i860 (implemented >in an elementary way -- i.e., without blocking the algorithm within >which the EROs are used) is that the FP unit is capable of >consuming/producing 64 bits of off-chip (single precision) operands >per cycle. This is what kills the iPSC i860 box -- my code is guaranteed to run at the rate of the memory system, with no useful caching effects. I read a lot of data. And the compilers can't even get up to memory speed, because they can't figure out very well how to optimize the loads and stores, because the problem is so complex and they run out of registers. >Note -- max pipelined load width is 64 bits, max cached load width is >128 bits. An N-unrolled loop to do the elementary row op has N >pipelined loads and stores, plus N/4 cached loads, plus one branch >instruction. In double op mode, you pair this with N mpy-and-adds and >N/4 + 1 fnops. Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds >at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16, >this is .76C. This analysis is a bit theoretical in my case, as the iPSC memory system provides (max pipelining) 2 cycle loads if you're on the same page, with a big penalty when you go to a different page. Even when you block loads until you run out of registers, you're paying a nasty penalty and complicating your compiler. For example: do i = 1, 128 z(i) = a * x(i) + y(i) enddo With vector registers you can do a simple pipelined load of all of x and y while calculating the results, and then write the results back. 3 page-mode-dram miss penalties total, and a solved compiler problem. With the current scheme you unroll by 4, load 4 x's, load 4 y's taking a page-miss penalty, while calculating the results, and then write the results back taking another page miss. That's 3 memory system page misses per cycle. (Note that I'm not talking about virtual memory page faults; for details on the memory system don't trust my memory, read the techreport.) When you go to more complex expressions I think vector registers win bigger and bigger, as long as you're looking at vector-type expressions. For scalar expressions, I think the two methods would provide about the same speed. So, to repeat: did Intel simulate this kind of alternative design, and would it provide better performance with a simpler compiler?