Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!sunybcs!boulder!grunwald
From: grunwald@foobar.colorado.edu (Dirk Grunwald)
Newsgroups: comp.arch
Subject: Re: MYRIAS - yet again
Message-ID: <14898@boulder.Colorado.EDU>
Date: 14 Dec 89 21:22:49 GMT
References: <13683@reed.UUCP> <515@ctycal.UUCP> <4218@amelia.nas.nasa.gov>
Sender: news@boulder.Colorado.EDU
Reply-To: grunwald@foobar.colorado.edu
Organization: University of Colorado at Boulder
Lines: 68
In-reply-to: serafini@amelia.nas.nasa.gov's message of 14 Dec 89 06:04:24 GMT

DBS> the hardware since they're trying to build a programming paradigm that will be
DBS> both easy to use and easy to port.  They claim that converting old code takes
DBS> hours or days instead of months.  Basically anything that can be vectorized
DBS> on a Cray can be parallelized on the Myrias.  They downplay the issues of


While it may be possible, I don't think it's practical. According to
the talk myrias gave here ( we have one somewhere, see ealier note) 
there is no synchronization possible.

Thus, you can't cheaply parallelize..

   Do I = 2, N
    A(I) = B(I) * C(I)
    D(I) = A(I-1) * C(I)
  end

On the Cray, this would be vectorized:
	A(2:N) = B(2:N) * C(2:N)
	D(2:N) = A(1:N-1) * C(2:N)

On a machine with synchronization, you could say:

   Doall I = 2, N
    A(I) = B(I) * C(I)
    POST(A,I)
    WAIT(A,I-1)
    D(I) = A(I-1) * C(I)
   end

or
   Doall I = 2,N
    A(I) = B(I) * C(I)
   end
   Doall I = 2,N
    D(I) = A(I-1) * C(I)
   end

The myrias forces the latter, because of no synchronization. You could
optimize this a little...

   S = (N-2)/Processors
   Doall IP = 1,S
   Do I = IP, IP + N - 1
    A(I) = B(I) * C(I)
    if (I != IP )
	D(I) = A(I-1) * C(I)
   end
   end
   Doall I = 1,S
    D(S * (N-2) ) = A((S * N-2)-1) * C((S*(N-2)))
   end

(more or less -- you just strip mine the loop based on the number of
  processors, execute all first statements, and only the second statements
  that are local to your strip, merge pages and then assign all
  cross-process iterations)

But you'll need to force a page merge betwen the two doall loops (
think they call them 'pardo' or something).

It's not clear to me this that this going to be faster than e.g.  a
CM-2 or a Cray.

For loops involving no cross-iteration dependence, however, it should
work well. I belive this is what they had intended, by the way, because
the designers (a physicist?) had several probelems with  no cross
iteration dependence.