Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!ames!necntc!linus!alliant!muller
From: muller@alliant.Alliant.COM (Jim Muller)
Newsgroups: comp.arch
Subject: Re: Single tasking the wave of the future?
Message-ID: <1030@alliant.Alliant.COM>
Date: 22 Dec 87 22:59:50 GMT
References: <18@amelia.nas.nasa.gov> <2341@encore.UUCP> <25@amelia.nas.nasa.gov>
Reply-To: muller@alliant.UUCP (Jim Muller)
Organization: Alliant Computer Systems, Littleton, MA
Lines: 140
Keywords: parallel processing today

In <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes,
with some references to article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay):

> >...And then there is more sophisticated mechanisms ("micro-tasking", gang
> >scheduling, vectorizing fortran compilers) available to any user who...

>A problem with these more sophisticated mechanisms is that they can
>lead to parallel execution in which the wall clock time goes up as a
>function of the number of processors, rather than down... 
   With vectorization this is typically due to excessive masking.  In heavily
   IF'ed code, the processor(s) can end up doing much more work than necessary.
   The vector overhead generally is insignificant by comparison.  True parallel
   processing, however, does not suffer this problem, since any one processor
   never has to execute the "not-to-be-executed" portions.  For this reason,
   true parallel processing can reduce runtimes even on code for which vector
   processing is a liability.

   [
   Example of code for which vectorization may be a liability:
       DO I = 1, 100
          IF (NS.GT.M) THEN
             .
             .
          ELSE IF (N..GT.MX) THEN
             IF (II.EQ.IX) THEN
                .
                .
             ENDIF
          ENDIF
   100  CONTINUE
   ]

>Another problem is that no compiler can perfectly optimize.  The better the
>programmer understands what the optimizer can optimize, the easier it
>is to write optimizable code.  With vector code, this is fairly
>straightforward to do, with parallel code, it is much more difficult.
>It has been my experience in five years of writing code to exploit
>distributed and concurrent parallelism that it is hard in a large
>number of cases.
   It is a truism for any machine, vector- or multi- or not, that a good
   program will run better than a poor one, and that a programmer who is
   ignorant of the compiler's capabilities can accidentally defeat its best
   efforts.  The same holds for hardware issues as well as software.  As for
   parallelism being harder to exploit than vectorization, the reverse is more
   likely to hold.  Parallel execution can be applied to practically any
   vectorizable structure, while the reverse is decidedly not true.  Many
   structures can be run as parallel processes, with runtime speedup, even if
   vectorization is either impossible or a liability.  Anyway, it should be
   the compiler's job to analyze code for parallel or vector opportunities,
   and compilers capable of doing that well do exist.

>There are well behaved algorithms, such as those involved in PDE solution
>for which parallelism is trivial. There are also algorithms requiring high
>communication cost, much synchronization and unpredictable work per task
>which are nearly untractable.  There appear to be an entire class of
>algorithms for which no parallel solution is more efficent than a sequential
>solution on the same processor.
   Indeed parallelism is trivial in some problems.  That does not make it
   useless.  Instead, it make it easy and efficient.  More to the point,
   however, for those problems which do not exhibit inherent parallelism,
   nothing requires the machine (or compiler) to use it where inappropriate.
   All that is needed is that the unused processors be doing something else.
   For problems where it *is* appropriate, it can produce significant runtime
   speedup.  In fact, most problems exhibit "some" parallelism, usually enough
   to be beneficial.  For large, computationally intensive problems, the gain
   in execution speed from parallelism will usually justify the amount of
   effort required to exploit it, especially if the compiler rather than the
   programmer does most of the work.
   
> >Debugging is a whole other soapbox, but my experience is that debugging
> >coding errors is not much more difficult than uniprocessors.  What is hard
> >(or "impossible" with current tools) is detecting race conditions and
> >bottlenecks - i.e. CONCEPTUAL errors. 
   Race conditions are typically a problem only when a compiler has been
   inappropriately "permitted" to optimize code that it normally would have
   refused.  The obvious approach, of course, should be to debug in single-
   processor mode.

>...Throwing a parallel machine at an average workload is like throwing a
>vector machine at it, only more so:  The difference between vendor "not to
>be exceeded" speed for the machine and delivered throughput for the
>workload can be two order of magnitude, dramatically reducing the real
>price / performance.
   This was discussed by this group just a month or so ago.  There are really
   two causes.  The first is the structure of the code itself.  Certainly,
   real life programs are often dominated by scalar code, and better-written
   code will be less likely to thwart the compiler.  More importantly, "peak"
   speeds quoted are usually the output rate during a vector instruction,
   totally ignoring the vector startup time.  Startup time can be significant
   and must be paid every time a vector instruction starts (or repeats if the
   loop length is greater than the vector length).  When this is taken into
   account, the *real* maximum rates can be achieved (well, almost) by codes
   that are well-structured for the machine.  For an "average workload",
   a multi-processor machine can be beneficial if the various processors are
   free to do other jobs when parallelism is inappropriate.

>Top end clock rates are down to a few nanoseconds, and it doesn't appear
>likely that subnanosecond rates are going to arrive before...
   Light (or electromagnetic pulses) can only travel so fast, so you have to
   make things smaller to make them faster (obviously).  This is *exactly* why
   radically different approaches such as parallelism, etc. are *appropriate*
   courses to follow, rather than useless.  It is the single processor that is
   becoming harder to speed up.

>...it is a most difficult problem, and one which decades of parallel
>processing research has not solved adequately...
>When you couple this with the fact that after twenty years of various
>kinds of software/hardware research, parallel processing is still
>limited to the technology described by Dykstra in 63, and still has
>the same problems in software development, it is probably fair to say
>that little progress is being made in software either.

   My background is science, not computer architecture, so I do not fully
   appreciate the "decades of parallel processing research" or the "past
   twenty years of various kinds of software/hardware".  But I am quite
   familiar with some approaches, and have seen that it does work. The
   Alliant FX/8 was based on Kuck (1976), and in fact, was designed to
   address many of the objections you raised about parallel processing's
   limitations.  It certainly cannot be described as equivalent to anything
   from 20 years ago.  It does have most of the features I have described
   above, including an intelligent compiler (you are right, the programmer
   should not have to explicitely deal with parallelism), parallel I/O
   capability, slowdown with vectorization for heavily-masked code (most
   vector machines behave this way), "other jobs" for free (non-busy)
   processors when parallelism is inappropriate, race-condition failure
   when the compiler is permitted to optimize inappropriately, good
   parallelism for some codes and none at all for others, significant
   vector startup time that reduces the "peak" speed to a more "real" speed,
   and, finally, a dependence, as with any other machine, on the programmer
   avoiding any of the pitfalls that will defeat the machine's best intentions.

REFERENCES
Kuck, D., "Parallel Processing of Ordinary Programs," Advances in Computing,
   Vol. 15, 119-179, M. Rubinoff and M.L. Yovits, Eds., Academic Press, 1976.
-----------------------------------------------------------------------------
This is not an official Alliant response.  I am simply an employee who feels
differently than you do.  I only mentioned the FX/8 because it is a good,
real-world, functioning example that attempts to address the valid objections
you raised, and because I understand it.  Alliant had no input into this
posting, nor did Alliant require me to write this disclaimer.