Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!ucsd!sdcsvax!trantor.harris-atd.com!melmac!chuck From: chuck@melmac.harris-atd.com (Chuck Musciano) Newsgroups: comp.arch Subject: Re: Speed Message-ID: <2023@trantor.harris-atd.com> Date: 4 May 89 14:18:58 GMT References: <13089@paris.ics.uci.edu> <150@dg.dg.com> <2187@wpi.wpi.edu> Sender: news@trantor.harris-atd.com Reply-To: chuck@trantor.harris-atd.com (Chuck Musciano) Distribution: comp Organization: Advanced Technology Dept., Harris Corp., Melbourne, Fl. Lines: 68 In article <2187@wpi.wpi.edu> lfoard@wpi.wpi.edu (Lawrence C Foard) writes: >Why isn't parallel computing used more? I can't think of many problems that >couldn't be done in parallel if the hardware/software to do it was done well. Well, having been a player in the quest for working parallel machines, here are some of the problems encountered in building large scale parallel machines: How do you program it? Parallelize existing code (the dusty deck problem), retrofit parallel constructs onto existing langauges, or design entirely new parallel programming languages? Also nontrivial: how do you change the mindset of thousands of programmers well-versed in serial programming techniques? For shared memory machines: what interconnection scheme do you use? Flat memory space? Heirarchical? What about caches to reduce long memory latencies? Cache coherency for thousands of processors? For message passing architectures: what interconnection scheme do you use? How do you handle messages in general? What sort of communication synchronization should you use? What sort of task granularity are you trying to support? Fine grained tasks of just a few instructions each? Coarse grained tasks of many thousand instructions? If you choose fine grained, how will you reduce the task overhead to make the tasks effective? If coarse grained is the way to go, how do you solve the load balancing problem when just a few processors have all the work, starving the remaining processors? Will you use static or dynamic scheduling? Static makes the task overhead smaller, encouraging granularity reduction, but it is very difficult to predict at compile-time exactly how big a task will be, so that you can correctly place tasks in the machine. Dynamic scheduling solves the load balancing problem at runtime, but increases task overhead to move tasks through the system. In addition, runtime heuristics for data distribution throughout large networks of machines are difficult to formulate. How do you debug your programs? What sort of mental model do you give your application programmers who must deal with hundreds of processors accessing both global and local data objects simul- taneously? How do you implement debugging primitives so that you can easily single step any, all, or just one processor? How can the debugger access all the data in all the processors, including the caches and local buffers? How do you even stop an entire machine all at once, given certain propogation delays throughout the architecture? How do you debug programs which are completely nondeterminisitic, with execution ordering being essentially random? How about your programming model? Communicating sequential processes? Control flow with barrier synchronization? Data flow? If so, static or dynamic? Graph reduction? How much of the underlying machine model do you promote into the programming languages? I could go on and on. Your instincts are right: there aren't many things which wouldn't run faster on a parallel machine. Unfortunately, what seems like a straightforward solution is in reality a very hard problem. If you are really interested, you might want to browse through past proceedings of the International Conference on Parallel Processing (Penn State Press) or some of the architecture and supercomputing conference proceedings. Chuck Musciano ARPA : chuck@trantor.harris-atd.com Harris Corporation Usenet: ...!uunet!x102a!trantor!chuck PO Box 37, MS 3A/1912 AT&T : (407) 727-6131 Melbourne, FL 32902 FAX : (407) 727-{5118,5227,4004}