Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!uxg.cso.uiuc.edu!uicsrd.csrd.uiuc.edu!jaxon
From: jaxon@uicsrd.csrd.uiuc.edu
Newsgroups: comp.lang.apl
Subject: Re: APL for Parallel Algorithms
Message-ID: <49700006@uicsrd.csrd.uiuc.edu>
Date: 12 Dec 88 19:00:00 GMT
References: <3779@hubcap.UUCP>
Lines: 73
Nf-ID: #R:hubcap.UUCP:3779:uicsrd.csrd.uiuc.edu:49700006:000:3476
Nf-From: uicsrd.csrd.uiuc.edu!jaxon    Dec 12 13:00:00 1988


>I want to compute the inner product A+.xB given the simple numeric matrices
>A and B, this seems easy to implement in a way which exploits vector/
>parallel facilities.

done

Many APLs already do this.  The inner loops of the interpreter typically
are tuned to the machine's pipeline, vector ops, or concurrency features.
STSC's UNIX APL for Alliant FX/8,  Analogic's APL Machine, are two popular
examples.  

> One of the largest problems with parallel APL is that to really
>get a speed up, you need a compiler and we all know how much fun compiling 
>APL is...

That is to say that naive interpretation using parallel features will only
improve the "per item" times, not the  "per function" or "per line" costs.
Eventually these costs dominate the computation time, especially in APL1
coding styles.  On the APL Machine, the attached low-precision array
processor reduces per-item times so much that "counting the symbols" in
an expression gives a good approximation of the number of milliseconds
it will compute, regardless of data size.

There is hope for compilation however.  Several aspects of APL make it
superior to Fortran 8x:

   - Array temporaries are inaccessible
   - Loop control information is implicit
   - Storage cannot be aliased or equivalenced
   - Data dependence distances (actual time between side-effects)
     is long relative to average Fortran
   - Arrays are used in their entirety more often than in Fortran
     DIMENSION R(100,100) ... but they only use the upper left NxM.
   - The parallelism in APL does not involve an explicit notion of
     execution thread.  In 8x extensions (such as PCF Fortran) the
     program text commits to a certain distribution of work.


>Another example would be many uses of the "for each"
>operator; doesn't this essentially specify independent execution chains?

  A standard for APL2 could be written to capture that
  idea, but as it stands right now all the APL2-like dialects agree
  that items of the arguments are processed in ravel order.  Only a
  smart compiler could detect that the functional operand of Each has
  no non-trivial side effect (fully parallelizable) or has infrequent
  enough side effects that a little synchronization is all that's needed.

  In my APLB implementation, the operators (Each, Reduce, Scan, Product)
  are by far the fastest way to loop through user code.  They can call
  a user fn repeatedly, yet only set up its stack and enter it once.
  I like to cite the fact that a multi-column numeric grade function
  written using the Reduce operator to accumulate the compound permutation
  vector from  a vector of the enclosed columns of the data, ran 
  substantially faster than the best APL1 code for this problem, and
  only slightly slower than a primitive implementation.  In fact we 
  decided to print it as an example under Grade and Reduce instead of
  adding more code to the numeric grade primitive.

  These operators should be optimal even on serial machines.  Users 
  will go wherever the code runs the fastest, and in APL they will
  use constructs which will usually be parallelizable.

  A more serious impediment to parallel/vector APLs is the arithmetic
  conventions of current hardware.  Tolerant equality, typeless numerics,
  and integer vector math  are not well-supported by current vendors.
  
We have this dream... and someday...

 .greg jaxon.    jaxon@uicsrd.uxc.uiuc.edu
 UI Center for Supercomputing Research and Development