Path: utzoo!attcan!uunet!lll-winken!sol.ctr.columbia.edu!cica!iuvax!silver!jashley
From: jashley@silver.ucs.indiana.edu (J. Michael Ashley)
Newsgroups: comp.lang.functional
Subject: re: Id array comprehensions revisited
Message-ID: <51173@iuvax.cs.indiana.edu>
Date: 17 Jul 90 16:28:29 GMT
Sender: news@iuvax.cs.indiana.edu
Reply-To: jashley@silver.ucs.indiana.edu (J. Michael Ashley)
Organization: Graceland
Lines: 62

Wolfgand Grieskamp writes:
>In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:
>>A little while ago I asked why array comprehensions are not parallel.  In
>>particular, why does an array comprehension like this:
>>
>>	{ array (1, n) | [i] = f i || i <- 1 to n }
>>
>not evaluate the (f i)'s concurrently?  Well, it turns out that the i's get
>generated sequentially as a list (a list in the data structure sense, mind
>you).
>
>This index list is intermediate and an optimizing compiler can eliminate it 
>using loop fusion.

[ omitting the pleasantly detailed derivation of optimization ]

>Using the above rule you get finaly:
>
>  DEF init1(A,f,i,n) == IF i>n  THEN A
>			IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI
>
>(A compiler will probably not use this fold/unfold method,
>because it depends to much on "eureka" steps. Furthermore, the view 
>is based on the assumption that update is O(1). Such an update
>can be established by a sharing analysis for the given example).
>
>Regarding to concurrent initialization assume simplisticaly that init1
>is not strict in its first argument.

Actually, this is precisely the optimization that the current Id compiler
does right now; if it sees the construct "1 to n", then the compiler will
transform it into a loop.

Since we are using I-structures, the update is O(1) (without sharing analysis),
and init1 is not strict in it's first argument.

This doesn't solve the problem, however.  Since we are using a dataflow
architecture, the function f will undergo partial evaluation to the point
where it must have it's argument.  What this means is that if we have lots
and lots of processors, 90% of the work of creating the array is done by
the first iteration or two of init1.  The execution is now bottlnecked by
this sequential generation of the indices.

A slightly better compiler will generate the indices in O(log n) time
without any trouble.

But this doesn't solve the problem completely.  In the case of 
accumulator comprehensions, the programmer will probably not be using this
easy-to-optimize "1 to n" construct.  Instead, there will be some arbitrary
function (probably complicated) that *must return a list*.  Even the most
trivial numerical codes will be more than a match for any compiler 
optimization.  So we are again stuck with lots of parallelism when 
generating the elements of the list but bounded by an O(n) append to put
this results back together to form the final list.

Thomas Breuel suggested a couple of ways ("difference lists and/or pointer
jumping") to overcome the O(n) append problem.  This is kind of a hack, and
Thomas admits that.  I still sit on the position that lists are not the
right data structure for the indices.

mike ashley
jashley@lanl.gov