Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!sri-spam!ames!amdcad!amd!intelca!mipos3!omepd!intelisc!littlei!ogcvax!pase
From: pase@ogcvax.UUCP (Douglas M. Pase)
Newsgroups: comp.arch
Subject: Re: chewing up mips with graphics
Message-ID: <1323@ogcvax.UUCP>
Date: Thu, 25-Jun-87 17:17:02 EDT
Article-I.D.: ogcvax.1323
Posted: Thu Jun 25 17:17:02 1987
Date-Received: Sat, 27-Jun-87 09:48:08 EDT
References: <8270@amdahl.amdahl.com> <359@rocky2.UUCP> <6240@steinmetz.steinmetz.UUCP> <6328@beta.UUCP> <2120@dg_rtp.UUCP> <astroatc.337>
Reply-To: pase@ogcvax.UUCP (Douglas M. Pase)
Organization: Oregon Graduate Center, Beaverton, OR
Lines: 87


	In article <2120@dg_rtp.UUCP> wood@dg_rtp.UUCP (Tom Wood) writes:

	Personally, I believe the 90% solution to obtaining parallelism is to
	take advantage of multiple independent computations.  (It's much easier
	to make 100 compiles go 100 times faster by using 100 machines than it
	is to make 1 machine go 100 times faster on each compile.)

I hope I'm not missing the point, but I think you're off by 10% -- that is,
the only approach to parallelism is, by definition, taking advantage of
multiple independent computations.  There's lots of levels to choose from, not
just one.  What you have mentioned here (with the 100 compiles) is parallelism
at the process level.  This approach is the easiest, the best understood, and 
many vendors provide commercial products which successfully take advantage of
this type of parallelism.  Honeywell has been doing this for years with CP-6,
and Sequent and Apollo are two newer entries.  (Oh, I see.  I bet this level
is what you meant by "independent" -- correct me if I still misunderstand.)
The advantage of this level is that the overhead required to parallelize the
computation is relatively small, and controlled by the system (eg in system
locks and resource scheduling) - not introduced into the computation itself.

The next level has multiple co-operating tasks, as in a producer/consumer
relationship and similar approaches.  At this level the overhead is built
directly into the application.  Sometimes the overhead required to distribute
sufficient information to run multiple parallel tasks cancels any benefits
that might have accrued.  Keller's Readyflow system operates at this level.
The Cray-X-MP, Sequent, Alliant, and some other shared memory machines can be
operated at this level using some form of microtasking.  All distributed memory
machines (such as the Intel Hypercube and NCube's machine) are operated at this
level.  (By the way, "large grain dataflow" is at this level.)

Another level is the instruction level.  At this level, instructions are
scheduled independently and in parallel.  The MIT dataflow machine is an
example of this.  Operands accumulate in a "waiting-matching store" until
all operands required by an operator have accumulated.  At that time the
operator and its operands are placed in a queue, and executed as soon as a
processor becomes available.  The Manchester dataflow machine works very
similarly to the MIT machine.  The Cray machines also take advantage of this
level of parallelism.

Perhaps the bottom level is the microcode level.  Any machine (such as the
DEC 8600 series) which pipelines its microcode is executing in parallel.
The Goodyear MPP is the machine which offers the most parallelism at this
level (although it's not exactly pipelined).  It weighs in at 16K 8-bit
processors.

	In article <astroatc.337> johnw@astroatc.UUCP (John F. Wardale) writes:

	This may be true, but for most "real" problems, some well know
	person determined that the average code spends 90% of its time
	executing 10% of its code.

	This and other related studys show that a large fraction of
	problems that have no, or very limited parrallelism.

As you have stated the study, it does *not* support your conclusion.  Tight
loops in FORTRAN (shudder) programs may often be parallelized, by pipelining
instructions.  Kuck's work on parallelizing compilers have shown amazing
improvements can be gained by pipelining and vectorizing DO loops.  (Yes,
both are a form of parallelism.)  It is the data dependencies which determine
the available parallelism, not the size of the code.

	A couple mounths ago there was some discussion of somebody's
	challenge (with a moderate cash prize) .... As I recall, you had
	to speed up a general problem (not limited to HIS problem set, but
	he could reject anything that was "embarrisingly parrallel" [like
	the 100 compiles example]) by a factor of 100, and you could use
	as many processors as you wanted.   Did anyone save any of these?
	Has anyone won the prize yet?

				John W
	- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
	Name:	John F. Wardale
	UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
	arpa:   astroatc!johnw@rsch.wisc.edu
	snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
	audio:	608-221-9001 eXt 110

This seems pretty easy to me.  Any fluid modeling or simulation problem such
as numerical weather forcasting, dynamic air-flow analysis, oceanographic
simulation, planet/galaxy formation, gas dispersion, etc., would benefit
a lot from just about any level of parallelism.  If this problem set isn't
sufficiently "real", how about finite-element analysis, or image processing?
I would rather solve any of these problems on the MPP than any single 8-bit
processor.
--
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase  or  pase@Oregon-Grad (CSNet)