Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!mintaka!think.com!spool.mu.edu!uunet!europa.asd.contel.com!gatech!mcnc!borg!cs.unc.edu!prins From: prins@cs.unc.edu (Jan Prins) Newsgroups: comp.lang.apl Subject: Re: ACORN, and other niceties Message-ID: <4481@borg.cs.unc.edu> Date: 19 Jun 91 19:39:50 GMT References: <1991Jun11.200234.6130@aplcen.apl.jhu.edu> <1991Jun13.044828.29504@yrloc.ipsa.reuter.COM> Sender: news@cs.unc.edu Organization: UNC-Chapel Hill Computer Science Lines: 67 In article <1991Jun13.044828.29504@yrloc.ipsa.reuter.COM>, Roger Hui writes: > > (about a remark of Ken Kennedy:) > > > 0. "APL is not a good language for doing parallel algorithms." His reasoning? > > People take huge outer products and then reduce or slice along a subarray, > > and thus the algorithms are inefficient (I tried to explain about > > properly-formed arguments for inner product, and about the RANK operator). > > These so-called "huge outer products" can take CONSTANT time to > compute on parallel machines. For example, consider the problem of > computing x e. y (x epsilon y) where x and y are vectors > of length m and n. One algorithm is > > +./ y =/ x (or/[0] y jot . = x) > > On a serial machine, this takes time O m*n (and IS inefficient). > On a SIMD machine (like the Connection Machine), a straightforward > translation of the algorithm yields: > > Propogate x and y to m*n processors O log m>.n > Compute the m*n = operations of the outer product O 1 [!] > Compute the or-reduction O log n > > The total time taken is O log m>.n . Log time is "speed of light"; > this "huge outer product" algorithm is in fact time optimal. Unfortunately I don't get to this newsgroup much, and although I'm a big fan of APL and related languages for data-parallel programming, I must raise some flags regarding the "efficiency" claim above. You speak easily of computing m*n comparison operations in unit time, but of course this presupposes that you have m*n physical processors. The correct time for this operation is O((m*n)/P) where P is the number of actual processors. Note that P is a constant! If you are wealthy you can buy P=64K today, but even this substantial machine runs out of processors for m=n=256 with this algorithm. For longer vectors the point is that this algorithm IS inefficient: its asymptotic time complexity is O(m*n) since P is constant. For sufficiently large inputs this algorithm will take longer than the efficient sequential algorithm on a sequential machine. In this case I would predict that at n=m=20,000 APL on my workstation would outrun a full-tilt CM-2 using the outer product algorithm. And it is precisely the larger instances of a problem (i.e. longer vectors) which drive people to consider execution on parallel platforms. I like the big outer products programming style: it is exquisitely expressive. But executed directly on large inputs, it is not efficient. I don't agree with Kennedy that this is a condemnation of APL's suitability for scientific programming. It suggests that some refinement is needed to move an APL program into production use on very large problem sizes. Perhaps a compiler can do this (this is awfully difficult and may be the basis of Kennedy's objection) or more likely the programmer must sort this out. This latter technique seems to be the one that Kennedy espouses in the Fortran-90 case, and I don't see why it couldn't apply to the APL case as well, with the added advantage of much more flexibility in the initial expression of the algorithm. > Roger Hui (hui@yrloc.ipsa.reuter.COM) --\-- Jan Prins (prins@cs.unc.edu) / Computer Science Dept. --\-- UNC Chapel Hill