Path: utzoo!attcan!uunet!wuarchive!brutus.cs.uiuc.edu!usc!elroy.jpl.nasa.gov!ames!sgi!bron@bronze.wpd.sgi.com
From: bron@bronze.wpd.sgi.com (Bron Campbell Nelson)
Newsgroups: comp.sys.sgi
Subject: Re: Multi-processor problems
Summary: Here's what's going on ...
Message-ID: <48174@sgi.sgi.com>
Date: 12 Jan 90 19:09:55 GMT
References: <9001120157.AA15338@smithkline.com>
Sender: bron@bronze.wpd.sgi.com
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 65

In article <9001120157.AA15338@smithkline.com>, dixons%phvax.dnet@SMITHKLINE.COM writes:
> I have been working on getting a FORTRAN program running parallel.  I seem
> to have gotten it running with reasonable load balance, etc but have
> observed a curious phenomenon which depends on the system load.  Here's
> what happens:
[decription deleted]
> In other words, using four processors suddenly takes 3 times longer than
> 1 processor.  This seems to be repeatable.  Also if two other computer
> bound jobs are each using a processor then the problem starts when
> three processors are used for the mp job.  
[more stuff deleted]
> Scott Dixon (dixons@smithklin.com)

The brief answer is: yes, there is a problem here, and the tools needed
to overcome it will be in the next major release (3.3 or whatever we
wind up calling it).

The considerably longer answer goes like this:  

The first (i.e. current) release of SGI's parallel Fortran only supports a
single model of parallel execution.  Namely, equal numbers of iterations of a
DO loop are assigned to each process.  When a parallel loop is entered, the
work is parceled out.  When a process finishes its piece of the parallel
loop, it waits at the bottom of the loop until all the other processes
finish their pieces (i.e. we do a barrier synchronization at the bottom
of each loop).

What happens in the case Scott describes is that a parallel loop is entered,
and iterations are assigned to all 4 processes of the parallel job.  Unfor-
tunately, the forth process cannot run since there is already another
compute bound process running on the forth cpu.  The other 3 processes
finish their piece, and then wait for the forth process.  However, they
must typically wait a very long time since the forth process has to wait
for some other process's time slice to expire, and then do a task switch.
All in all, a very messy business.

This problem happens because the parallel job wants all 4 cpus in order to
run efficiently, but it can't get all 4 cpu's because other jobs are running.
Admittedly, this is hardly surprising; it's a rare person who gets a whole
4D/240 dedicated to their personal use!

Right now, what you can do is restrict the number of cpus that a job asks
for.  Instead of trying to use all the cpus, only use half (or whatever).
In the next release, there will be 2 new enhancements that will help
cure this problem:  First, the process scheduler has been enhanced to
support "gang" scheduling.  In this mode, the parallel job will have all
of its processes scheduled as a unit (i.e. "all or nothing").  This avoids
the "wait for a process to be scheduled" problem described above.  Second,
we support dynamic assignment of loop iterations to processes, so rather
than assigning some loop iterations to all the processes, the next iteration
gets assigned to the next available process.  This allows parallel loops to
complete even if some processes of the parallel job never get to run.  This is
more flexible, but since the parcelling out of iterations must now be
controlled with a critical section, the overhead is higher.

Personally, I suspect that the best way to run will be to gang schedule *and*
use only 3 cpus.  That way you won't get the whole job kicked out just
because one other process wants to run.

Hope the helps.

--
Bron Campbell Nelson
bron@sgi.com  or possibly  ..!ames!sgi!bron
These statements are my own, not those of Silicon Graphics.