Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!mintaka!think.com!spool.mu.edu!uunet!europa.asd.contel.com!gatech!mcnc!borg!cs.unc.edu!prins
From: prins@cs.unc.edu (Jan Prins)
Newsgroups: comp.lang.apl
Subject: Re: ACORN, and other niceties
Message-ID: <4481@borg.cs.unc.edu>
Date: 19 Jun 91 19:39:50 GMT
References: <1991Jun11.200234.6130@aplcen.apl.jhu.edu> <1991Jun13.044828.29504@yrloc.ipsa.reuter.COM>
Sender: news@cs.unc.edu
Organization: UNC-Chapel Hill Computer Science
Lines: 67

In article <1991Jun13.044828.29504@yrloc.ipsa.reuter.COM>, Roger Hui writes:
>
>  (about a remark of Ken Kennedy:)
>
> > 0.  "APL is not a good language for doing parallel algorithms."  His reasoning?
> > People take huge outer products and then reduce or slice along a subarray,
> > and thus the algorithms are inefficient (I tried to explain about
> > properly-formed arguments for inner product, and about the RANK operator).
>  
> These so-called "huge outer products" can take CONSTANT time to
> compute on parallel machines.  For example, consider the problem of
> computing  x e. y  (x epsilon y)  where x and y are vectors
> of length m and n.  One algorithm is
>  
>  +./ y =/ x   (or/[0] y jot . = x)
>  
> On a serial machine, this takes time O m*n (and IS inefficient).
> On a SIMD machine (like the Connection Machine), a straightforward
> translation of the algorithm yields:
>  
>  Propogate x and y to m*n processors                 O log m>.n
>  Compute the m*n = operations of the outer product   O 1        [!]
>  Compute the or-reduction                            O log n
>  
> The total time taken is O log m>.n .  Log time is "speed of light";
> this "huge outer product" algorithm is in fact time optimal.

Unfortunately I don't get to this newsgroup much, and although I'm
a big fan of APL and related languages for data-parallel programming, 
I must raise some flags regarding the "efficiency" claim above.

You speak easily of computing m*n comparison operations in unit time,
but of course this presupposes that you have m*n physical processors. 
The correct time for this operation is O((m*n)/P) where P is the number 
of actual processors.  Note that P is a constant!  If you are wealthy
you can buy P=64K today, but even this substantial machine runs out
of processors for m=n=256 with this algorithm. 

For longer vectors the point is that this algorithm IS inefficient:
its asymptotic time complexity is O(m*n) since P is constant.  For 
sufficiently large inputs this algorithm will take longer than the 
efficient sequential algorithm on a sequential machine.  In this case
I would predict that at n=m=20,000 APL on my workstation would outrun a 
full-tilt CM-2 using the outer product algorithm. 

And it is precisely the larger instances of a problem (i.e. longer
vectors) which drive people to consider execution on parallel platforms.

I like the big outer products programming style: it is exquisitely 
expressive.  But executed directly on large inputs, it is not efficient.  
I don't agree with Kennedy that this is a condemnation of APL's
suitability for scientific programming.  It suggests that some refinement 
is needed to move an APL program into production use on very large 
problem sizes.  Perhaps a compiler can do this (this is awfully difficult 
and may be the basis of Kennedy's objection) or more likely the programmer 
must sort this out.  This latter technique seems to be the one that
Kennedy espouses in the Fortran-90 case, and I don't see why it couldn't
apply to the APL case as well, with the added advantage of much more
flexibility in the initial expression of the algorithm.


> Roger Hui (hui@yrloc.ipsa.reuter.COM)


--\--  Jan Prins  (prins@cs.unc.edu) 
  /    Computer Science Dept.        
--\--  UNC Chapel Hill