Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!purdue!gatech!hubcap!hammonds From: hammonds@riacs.edu (Steve Hammond) Newsgroups: comp.parallel Subject: Re: Opinions on Debugging Parallel Programs Message-ID: <4821@hubcap.UUCP> Date: 17 Mar 89 13:04:29 GMT Sender: fpst@hubcap.UUCP Lines: 94 Approved: parallel@hubcap.clemson.edu >One of the hot topics of research today is how to debug parallel >programs, both on shared memory multiprocessors and on distributed >memory machines. Often it seems these debugger systems are developed >more for ease of implementation rather than for providing maximum >utility and ease of use. What I'd like to do is get some opinions on >just what sort of features a good debugger system for parallel programs >should provide. > > >The kind of information I'm looking for includes: > >Is it harder to write and debug new parallel programs, or to parallelize >"dusty deck" serial programs? It is not clear to me how you measure "harder". Do you mean harder to get *something* running or harder to squeeze that last cpu second out of the code? I have written parallel algorithms from scratch and parallelized sequential codes (not really dusty deck stuff since it was pretty well written fortran code and the application was ammenable to ||'sm. If good "software engineering" techniques are applied then neither are very hard to program, i.e., get working code. To me, the hard part is the thought that goes into problem partitioning and algorithm design before your hands ever touch the keyboard. The best way that I have found to get code ( || and sequential) is to get a kernel working and then incrementally add pieces to it until you have worked up a running system. I have just finished coding an iterative solver for large sparse linear systems (arising from discretized PDE's) on a sequent balance 21000. Now I moving to the connection machine. >What are the most common bugs you have encountered during parallel >programming development and production runs (e.g., unintentional change >to a shared variable, etc.). The most common bug that I have run into is a synchronization problem (on the sequent, an MIMD machine), one process modifying a shared variable before it should. It is difficult to explain in just a few lines so I will leave it at that. >What methods have you used in your attempts to debug parallel programs? >Of these, which were most successful? On the sequent, I used pdbx. It really wasn't that helpful because most of the errors I tried to find were due to timing and often I could not find the error since breakpoints set at the end of procedures synchronized the code and made it run differently than under normal operating conditions. Mostly I just started littering my code with barriers until the problem went away and then I would start removing them until the problem surfaced again. That pointed to the timing problem which usually resulted from probelm partitioning, etc. >What types of tools do you think would be helpful in developing and >debugging parallel programs? (For example, would it be helpful to >observe sequential execution within each process executing in parallel?) I think a useful tool would be something that captured the order of "events" to make a MIMD program have a repeatable order of execution. When I am debugging I want a deterministic sequence of events. For example, I want processes to finish tasks in the same order. I believe something like this was being worked on at U. Rochester. I think that one of the people involved was Tom LeBlanc if you want to check into it. It was being developed on their 128 node butterfly. Anyone know the status of this? >Since many people now include performance evaluation and improvement as >part of the debugging process when dealing with parallel programs, what >type of information would be useful in this area? Perhaps for a shared memory machine one would be interested in bus contention or hot spots in memory. >If you have used any existing parallel debugger systems, either >commercial or experimental, could you name them and give me some >feedback on their usefulness? I have used pdbx. It is not truly useful. It does give one the capability to stop all processes and let them execute one at a time. Basically, it was just dbx running with multiple processes. >Sue Utter >Technology Integration Group >Cornell National Supercomputer Facility Steve -- Steve Hammond * Parallel Systems Division * RIACS * NASA Ames Research Center