Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!haven.umd.edu!uvaarpa!murdoch!astsun.astro.Virginia.EDU!gl8f
From: gl8f@astsun.astro.Virginia.EDU (Greg Lindahl)
Newsgroups: comp.arch
Subject: Re: what if the i860 had vector registers... ?
Message-ID: <1991May28.190220.22150@murdoch.acc.Virginia.EDU>
Date: 28 May 91 19:02:20 GMT
References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU> <13990@exodus.Eng.Sun.COM>
Sender: usenet@murdoch.acc.Virginia.EDU
Organization: Department of Astronomy, University of Virginia
Lines: 66

In article <13990@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes:

>>So, what if the i860 instead had a small data cache, and pipelined
>>loads into vector registers? It is relatively easy to design a
>>compiler that can do a good job with vector registers. And you would
>>be able to do pipelined loads of data to be re-used. And you wouldn't
>>end up starved for registers.
>
>Can you be more explicit about what you mean by vector registers?

I was thinking of something like Cray vector registers, or just
explicitly-managed cache, with pipelined loads into cache instead of
the current non-pipelined loads into cache. The the ALU would be
cache-to-cache instead of register-to-register.

>One constraint on elementary row operations on the i860 (implemented
>in an elementary way -- i.e., without blocking the algorithm within
>which the EROs are used) is that the FP unit is capable of
>consuming/producing 64 bits of off-chip (single precision) operands
>per cycle.

This is what kills the iPSC i860 box -- my code is guaranteed to run
at the rate of the memory system, with no useful caching effects. I
read a lot of data. And the compilers can't even get up to memory
speed, because they can't figure out very well how to optimize the
loads and stores, because the problem is so complex and they run out
of registers.

>Note -- max pipelined load width is 64 bits, max cached load width is
>128 bits.  An N-unrolled loop to do the elementary row op has N
>pipelined loads and stores, plus N/4 cached loads, plus one branch
>instruction.  In double op mode, you pair this with N mpy-and-adds and
>N/4 + 1 fnops.  Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds
>at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16,
>this is .76C.

This analysis is a bit theoretical in my case, as the iPSC memory
system provides (max pipelining) 2 cycle loads if you're on the same
page, with a big penalty when you go to a different page. Even when
you block loads until you run out of registers, you're paying a nasty
penalty and complicating your compiler.

For example:

do i = 1, 128
   z(i) = a * x(i) + y(i)
enddo

With vector registers you can do a simple pipelined load of all of x
and y while calculating the results, and then write the results back.
3 page-mode-dram miss penalties total, and a solved compiler problem.

With the current scheme you unroll by 4, load 4 x's, load 4 y's taking
a page-miss penalty, while calculating the results, and then write the
results back taking another page miss. That's 3 memory system page
misses per cycle. (Note that I'm not talking about virtual memory page
faults; for details on the memory system don't trust my memory, read
the techreport.)

When you go to more complex expressions I think vector registers win
bigger and bigger, as long as you're looking at vector-type
expressions. For scalar expressions, I think the two methods would
provide about the same speed.

So, to repeat: did Intel simulate this kind of alternative design, and
would it provide better performance with a simpler compiler?