Path: utzoo!mnetor!uunet!husc6!necntc!encore!fay From: fay@encore.UUCP (Peter Fay) Newsgroups: comp.arch Subject: Re: Single tasking the wave of the future? Message-ID: <2379@encore.UUCP> Date: 21 Dec 87 16:07:15 GMT References: <201@PT.CS.CMU.EDU> <388@sdcjove.CAM.UNISYS.COM> <988@edge.UUCP> <1227@sugar.UUCP> <151@sdeggo.UUCP> <1423@cuuxb.ATT.COM> <439@xyzzy.UUCP> <440@xyzzy.UUCP> <36083@sun.uucp> <18@amelia.nas.nasa.gov> <2341@encore.UUCP> <25@amelia.nas.nasa.gov> Reply-To: fay@encore.UUCP (Peter Fay) Organization: Encore Computer Corp, Marlboro, MA Lines: 131 To begin with, my statement apparently was found offensive: In article <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes: #>By the way, I don't fault people for not understanding #>about multis (e.g., from past exposure or school). It takes some #>time for common misconceptions to catch up to current reality. # #Since I don't see a smiley face on this, I will assume that it was #intended to be as obnoxious and condescending as it sounds. [...] No insult was intended. My point is that, as shown by your own example (about the multiprocessor VAX 11/782 being slower than a uniprocessor 780) older multiprocessor technology has left a bad impression of multis in general, much of which has since been solved by symmetric shared-memory multiprocessors. Mr. Fouts original statement regarding parallel processors: #[...] Parallel #processing, at least in the sense of throwing multiple processors at a #single problem is currently difficult to use. Software is hard to #write, easy to get wrong and nearly impossible to debug. As long as #this stays true, parallel processor machines will have a burden beyond #price/performance to overcome before they are accepted. My response was the follwing: #In regard to general-purpose multiprocessors: NOTE! -> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Every fork() (or thread_create() in Mach) in every program can get #scheduled on a different cpu (that includes every shell per user, #daemon, background task, ... Also, don't forget all those kernel #processes (or tasks, threads) running on different cpus (pagers, #signal handlers, ...). How difficult is it when the O.S. does it #transparently? # #And then there is more sophisticated mechanisms ("micro-tasking", gang #scheduling, vectorizing fortran compilers) available to any user who #wants more capability. He responded to me: #Mr. Fay is using transparently in a way with which I am unfamilar. It #is true that Mach provides primitives which allow the programmer to #introduce multitasking into the program; but these are in no sense #transparent. Task creation and deletion, synchronization, and task #scheduling all require explict code in the program which is to take #advantage of the tasking. Even the ancient Unix fork() has to be #explicitly coded for. # Yes, fork() and thread_create() must be explicitly coded. One way of looking at this though, is that, for example, Unix utilities already have about 850 fork()'s ALREADY coded in them. Compile them and new fork()'s will generally be run in parallel by the O.S. This usually is not optimal code, but it IS parallel and ISN'T difficult. Our modified "make" utility uses simple forks, shared memory and a dependency graph and often produces near-linear speedup. This parallelism enabled our machine to have the fastest Ada Validation Suit (just over three hours). No, this isn't a marketing pitch, just a justification of my remarks. (I still don't know why anyone would want Ada, anyway :-) I just compiled our 4.2 Unix kernel from scratch (152,000 lines in 13 minutes): 3461.2u 642.1s 13.26 508% 0+2k 1842+9439io 59pf+0w Using a 6-cpu machine, I got a speedup of 508% (obviously, I had to share some cpus with the O.S. and daemons). # #A problem with [micro-tasking, gang scheduling, vectorizing] is that they can #lead to parallel execution in which the wall clock time goes up as a #function of the number of processors, rather than down. . . True. It "can" also go down. As we all know, it depends on the algorithm and how it is coded. If there is a high amount of data dependency, many synchronization points, and little computation between, adding more processors may slow it down (on a supercomputer or a multi). And if that's the only algorithm you have, then a multi is not for you. If, however, you do lots of compiles, use lots of utilities, write operating system code (like me), use databases, AI, sorts, TP, editing, concurrent Ada & pascal, parallel OPS5, etc. (as I said, _general_purpose_), then a multi is the only way to go. #Another #problem is that no compiler can perfectly optimize. The better the #programmer understands what the optimizer can optimize, the easier it #is to write optimizable code. With vector code, this is fairly #straightforward to do, with parallel code, it is much more difficult. # I agree. #>Writing software which exploits the FULL parallelism of a machine MAY #>be hard to do in CERAIN cases. # #It has been my experience in five years of writing code to exploit #distributed and concurrent parallelism that it is hard in a large #number of cases. [...] There are #also algorithms requiring high communication cost, much #synchronization and unpredictable work per task which are nearly #untractable. There appear to be an entire class of algorithms for #which no parallel solution is more efficent than a sequential solution #on the same processor. I too can come up with many examples of algorithms that are hard or impossible to parallelize. Some architectures can exploit these better than others. #[...] it is probably fair to say #that little progress is being made in software either. # As I said originally: "This is one of the many time lags in parallel software tool development, not an inherent defect in architecture." >As an example of a class of algorithms which is difficult to vectorize >or parallelize, let me pull out the ancient prime finder algorithm: See my next followup - this is already too long. -- peter fay fay@multimax.arpa {allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay