Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!purdue!gatech!hubcap!hammonds
From: hammonds@riacs.edu (Steve Hammond)
Newsgroups: comp.parallel
Subject: Re: Opinions on Debugging Parallel Programs
Message-ID: <4821@hubcap.UUCP>
Date: 17 Mar 89 13:04:29 GMT
Sender: fpst@hubcap.UUCP
Lines: 94
Approved: parallel@hubcap.clemson.edu

>One of the hot topics of research today is how to debug parallel
>programs, both on shared memory multiprocessors and on distributed
>memory machines.  Often it seems these debugger systems are developed
>more for ease of implementation rather than for providing maximum
>utility and ease of use.  What I'd like to do is get some opinions on
>just what sort of features a good debugger system for parallel programs
>should provide.
> 
> 
>The kind of information I'm looking for includes:
> 
>Is it harder to write and debug new parallel programs, or to parallelize
>"dusty deck" serial programs?

It is not clear to me how you measure "harder".  Do you mean harder
to get *something* running or harder to squeeze that last cpu second
out of the code?  I have written parallel algorithms from scratch
and parallelized sequential codes (not really dusty deck stuff since
it was pretty well written fortran code and the application was
ammenable to ||'sm.  If good "software engineering" techniques
are applied then neither are very hard to program, i.e., get working
code.  To me, the hard part is the thought that goes into problem
partitioning and algorithm design before your hands ever touch the
keyboard.

The best way that I have found to get code ( || and sequential)
is to get a kernel working and then incrementally add pieces to
it until you have worked up a running system.  I have just
finished coding an iterative solver for large sparse linear
systems (arising from discretized PDE's) on a sequent balance 21000.
Now I moving to the connection machine.

>What are the most common bugs you have encountered during parallel
>programming development and production runs (e.g., unintentional change
>to a shared variable, etc.).

The most common bug that I have run into is a synchronization
problem (on the sequent, an MIMD machine), one process modifying
a shared variable before it should.  It is difficult to explain
in just a few lines so I will leave it at that.

>What methods have you used in your attempts to debug parallel programs?
>Of these, which were most successful?

On the sequent, I used pdbx.  It really wasn't that helpful
because most of the errors I tried to find were due to
timing and often I could not find the error since breakpoints
set at the end of procedures synchronized the code and made it
run differently than under normal operating conditions.
Mostly I just started littering my code with barriers until
the problem went away and then I would start removing them
until the problem surfaced again.  That pointed to the timing problem
which usually resulted from probelm partitioning, etc.

>What types of tools do you think would be helpful in developing and
>debugging parallel programs?  (For example, would it be helpful to
>observe sequential execution within each process executing in parallel?)

I think a useful tool would be something that captured the order
of "events" to make a MIMD program have a repeatable order of
execution.  When I am debugging I want a deterministic sequence
of events.  For example, I want processes to finish tasks in the same order.
I believe something like this was being worked on at U. Rochester.
I think that one of the people involved was Tom LeBlanc if you
want to check into it.  It was being developed on their 128 node butterfly.
Anyone know the status of this?

>Since many people now include performance evaluation and improvement as
>part of the debugging process when dealing with parallel programs, what
>type of information would be useful in this area?

Perhaps for a shared memory machine one would be interested in
bus contention or hot spots in memory.

>If you have used any existing parallel debugger systems, either
>commercial or experimental, could you name them and give me some
>feedback on their usefulness?

I have used pdbx.  It is not truly useful.  It does give one
the capability to stop all processes and let them execute
one at a time.  Basically, it was just dbx running with
multiple processes.

>Sue Utter
>Technology Integration Group
>Cornell National Supercomputer Facility


    Steve


-- 

 Steve Hammond  * Parallel Systems Division * RIACS * NASA Ames Research Center