Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!uxg.cso.uiuc.edu!uicsrd.csrd.uiuc.edu!jaxon From: jaxon@uicsrd.csrd.uiuc.edu Newsgroups: comp.lang.apl Subject: Re: APL for Parallel Algorithms Message-ID: <49700006@uicsrd.csrd.uiuc.edu> Date: 12 Dec 88 19:00:00 GMT References: <3779@hubcap.UUCP> Lines: 73 Nf-ID: #R:hubcap.UUCP:3779:uicsrd.csrd.uiuc.edu:49700006:000:3476 Nf-From: uicsrd.csrd.uiuc.edu!jaxon Dec 12 13:00:00 1988 >I want to compute the inner product A+.xB given the simple numeric matrices >A and B, this seems easy to implement in a way which exploits vector/ >parallel facilities. done Many APLs already do this. The inner loops of the interpreter typically are tuned to the machine's pipeline, vector ops, or concurrency features. STSC's UNIX APL for Alliant FX/8, Analogic's APL Machine, are two popular examples. > One of the largest problems with parallel APL is that to really >get a speed up, you need a compiler and we all know how much fun compiling >APL is... That is to say that naive interpretation using parallel features will only improve the "per item" times, not the "per function" or "per line" costs. Eventually these costs dominate the computation time, especially in APL1 coding styles. On the APL Machine, the attached low-precision array processor reduces per-item times so much that "counting the symbols" in an expression gives a good approximation of the number of milliseconds it will compute, regardless of data size. There is hope for compilation however. Several aspects of APL make it superior to Fortran 8x: - Array temporaries are inaccessible - Loop control information is implicit - Storage cannot be aliased or equivalenced - Data dependence distances (actual time between side-effects) is long relative to average Fortran - Arrays are used in their entirety more often than in Fortran DIMENSION R(100,100) ... but they only use the upper left NxM. - The parallelism in APL does not involve an explicit notion of execution thread. In 8x extensions (such as PCF Fortran) the program text commits to a certain distribution of work. >Another example would be many uses of the "for each" >operator; doesn't this essentially specify independent execution chains? A standard for APL2 could be written to capture that idea, but as it stands right now all the APL2-like dialects agree that items of the arguments are processed in ravel order. Only a smart compiler could detect that the functional operand of Each has no non-trivial side effect (fully parallelizable) or has infrequent enough side effects that a little synchronization is all that's needed. In my APLB implementation, the operators (Each, Reduce, Scan, Product) are by far the fastest way to loop through user code. They can call a user fn repeatedly, yet only set up its stack and enter it once. I like to cite the fact that a multi-column numeric grade function written using the Reduce operator to accumulate the compound permutation vector from a vector of the enclosed columns of the data, ran substantially faster than the best APL1 code for this problem, and only slightly slower than a primitive implementation. In fact we decided to print it as an example under Grade and Reduce instead of adding more code to the numeric grade primitive. These operators should be optimal even on serial machines. Users will go wherever the code runs the fastest, and in APL they will use constructs which will usually be parallelizable. A more serious impediment to parallel/vector APLs is the arithmetic conventions of current hardware. Tolerant equality, typeless numerics, and integer vector math are not well-supported by current vendors. We have this dream... and someday... .greg jaxon. jaxon@uicsrd.uxc.uiuc.edu UI Center for Supercomputing Research and Development