Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!rice!ariel.rice.edu!preston
From: preston@ariel.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: what if the i860 had vector registers... ?
Message-ID: <1991May29.112250.22178@rice.edu>
Date: 29 May 91 11:22:50 GMT
References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU> <13990@exodus.Eng.Sun.COM> <1991May28.190220.22150@murdoch.acc.Virginia.EDU>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 66

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:

>This is what kills the iPSC i860 box -- my code is guaranteed to run
>at the rate of the memory system, with no useful caching effects.

This problem kills most everything except vector machines (ala Cray).
That's why BLAS-2 and BLAS-3 routines were invented (i.e., to provide
higher-level operations with enough data reuse that the memory
system can keep up with the FP unit).

Try saxpy's on an HP or IBM when then data is not in cache.
They aren't going to get close to peak performance either.

--------------------

>For example:
>
>do i = 1, 128
>   z(i) = a * x(i) + y(i)
>enddo

>With the current scheme you unroll by 4, load 4 x's, load 4 y's taking
>a page-miss penalty, while calculating the results, and then write the
>results back taking another page miss. That's 3 memory system page
>misses per cycle.
            ^^^^^  iteration

Generally, you won't be able to use the load/store quad instructions because
it's difficult to be assured of alignment.

It's better to handle the above case by copying x into an aligned
chunk of memory (called VR1, say) using pipelined loads (4 in a row)
and store-quads.  Then copy y into VR2.  Finally, calculate the sum
into VR3 (using quad-loads and stores).  Then copy VR3 back to z
using quad-loads and singleton stores.

VR3 might be identical with either VR1 or VR2.
Generally, the VRs will end up in cache.

Naturally, we could use subroutines for the copying between
memory and cache/VRs and for the vector add.

------------

Of course, Intel though of all this.
They do it, using a traditional vectorizing front-end.

The problem is that there's still insufficient memory bandwidth
to support vector operations at 80 MFlops.  That kind of performance
requires hefty $$s.

------------

Note also that the "page misses" mentioned above and in Moyer's TR
are on a particular board, build around 8 MBytes of dynamic RAM.
Other boards have been built with static RAM ($$'s) that don't have
the problem of crossing page boundaries.

-------------

I think the difficulties with the i860 lie more in generating code
for the exposed pipelines and handling their unhelpful multiply-add
instruction.  The bandwidth questions are going to afflict all
the micro's for a while.

Preston Briggs