Path: utzoo!attcan!uunet!wuarchive!brutus.cs.uiuc.edu!usc!elroy.jpl.nasa.gov!ames!sgi!bron@bronze.wpd.sgi.com From: bron@bronze.wpd.sgi.com (Bron Campbell Nelson) Newsgroups: comp.sys.sgi Subject: Re: Multi-processor problems Summary: Here's what's going on ... Message-ID: <48174@sgi.sgi.com> Date: 12 Jan 90 19:09:55 GMT References: <9001120157.AA15338@smithkline.com> Sender: bron@bronze.wpd.sgi.com Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 65 In article <9001120157.AA15338@smithkline.com>, dixons%phvax.dnet@SMITHKLINE.COM writes: > I have been working on getting a FORTRAN program running parallel. I seem > to have gotten it running with reasonable load balance, etc but have > observed a curious phenomenon which depends on the system load. Here's > what happens: [decription deleted] > In other words, using four processors suddenly takes 3 times longer than > 1 processor. This seems to be repeatable. Also if two other computer > bound jobs are each using a processor then the problem starts when > three processors are used for the mp job. [more stuff deleted] > Scott Dixon (dixons@smithklin.com) The brief answer is: yes, there is a problem here, and the tools needed to overcome it will be in the next major release (3.3 or whatever we wind up calling it). The considerably longer answer goes like this: The first (i.e. current) release of SGI's parallel Fortran only supports a single model of parallel execution. Namely, equal numbers of iterations of a DO loop are assigned to each process. When a parallel loop is entered, the work is parceled out. When a process finishes its piece of the parallel loop, it waits at the bottom of the loop until all the other processes finish their pieces (i.e. we do a barrier synchronization at the bottom of each loop). What happens in the case Scott describes is that a parallel loop is entered, and iterations are assigned to all 4 processes of the parallel job. Unfor- tunately, the forth process cannot run since there is already another compute bound process running on the forth cpu. The other 3 processes finish their piece, and then wait for the forth process. However, they must typically wait a very long time since the forth process has to wait for some other process's time slice to expire, and then do a task switch. All in all, a very messy business. This problem happens because the parallel job wants all 4 cpus in order to run efficiently, but it can't get all 4 cpu's because other jobs are running. Admittedly, this is hardly surprising; it's a rare person who gets a whole 4D/240 dedicated to their personal use! Right now, what you can do is restrict the number of cpus that a job asks for. Instead of trying to use all the cpus, only use half (or whatever). In the next release, there will be 2 new enhancements that will help cure this problem: First, the process scheduler has been enhanced to support "gang" scheduling. In this mode, the parallel job will have all of its processes scheduled as a unit (i.e. "all or nothing"). This avoids the "wait for a process to be scheduled" problem described above. Second, we support dynamic assignment of loop iterations to processes, so rather than assigning some loop iterations to all the processes, the next iteration gets assigned to the next available process. This allows parallel loops to complete even if some processes of the parallel job never get to run. This is more flexible, but since the parcelling out of iterations must now be controlled with a critical section, the overhead is higher. Personally, I suspect that the best way to run will be to gang schedule *and* use only 3 cpus. That way you won't get the whole job kicked out just because one other process wants to run. Hope the helps. -- Bron Campbell Nelson bron@sgi.com or possibly ..!ames!sgi!bron These statements are my own, not those of Silicon Graphics.