Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!ames!necntc!linus!alliant!muller From: muller@alliant.Alliant.COM (Jim Muller) Newsgroups: comp.arch Subject: Re: Single tasking the wave of the future? Message-ID: <1030@alliant.Alliant.COM> Date: 22 Dec 87 22:59:50 GMT References: <18@amelia.nas.nasa.gov> <2341@encore.UUCP> <25@amelia.nas.nasa.gov> Reply-To: muller@alliant.UUCP (Jim Muller) Organization: Alliant Computer Systems, Littleton, MA Lines: 140 Keywords: parallel processing today In <25@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes, with some references to article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay): > >...And then there is more sophisticated mechanisms ("micro-tasking", gang > >scheduling, vectorizing fortran compilers) available to any user who... >A problem with these more sophisticated mechanisms is that they can >lead to parallel execution in which the wall clock time goes up as a >function of the number of processors, rather than down... With vectorization this is typically due to excessive masking. In heavily IF'ed code, the processor(s) can end up doing much more work than necessary. The vector overhead generally is insignificant by comparison. True parallel processing, however, does not suffer this problem, since any one processor never has to execute the "not-to-be-executed" portions. For this reason, true parallel processing can reduce runtimes even on code for which vector processing is a liability. [ Example of code for which vectorization may be a liability: DO I = 1, 100 IF (NS.GT.M) THEN . . ELSE IF (N..GT.MX) THEN IF (II.EQ.IX) THEN . . ENDIF ENDIF 100 CONTINUE ] >Another problem is that no compiler can perfectly optimize. The better the >programmer understands what the optimizer can optimize, the easier it >is to write optimizable code. With vector code, this is fairly >straightforward to do, with parallel code, it is much more difficult. >It has been my experience in five years of writing code to exploit >distributed and concurrent parallelism that it is hard in a large >number of cases. It is a truism for any machine, vector- or multi- or not, that a good program will run better than a poor one, and that a programmer who is ignorant of the compiler's capabilities can accidentally defeat its best efforts. The same holds for hardware issues as well as software. As for parallelism being harder to exploit than vectorization, the reverse is more likely to hold. Parallel execution can be applied to practically any vectorizable structure, while the reverse is decidedly not true. Many structures can be run as parallel processes, with runtime speedup, even if vectorization is either impossible or a liability. Anyway, it should be the compiler's job to analyze code for parallel or vector opportunities, and compilers capable of doing that well do exist. >There are well behaved algorithms, such as those involved in PDE solution >for which parallelism is trivial. There are also algorithms requiring high >communication cost, much synchronization and unpredictable work per task >which are nearly untractable. There appear to be an entire class of >algorithms for which no parallel solution is more efficent than a sequential >solution on the same processor. Indeed parallelism is trivial in some problems. That does not make it useless. Instead, it make it easy and efficient. More to the point, however, for those problems which do not exhibit inherent parallelism, nothing requires the machine (or compiler) to use it where inappropriate. All that is needed is that the unused processors be doing something else. For problems where it *is* appropriate, it can produce significant runtime speedup. In fact, most problems exhibit "some" parallelism, usually enough to be beneficial. For large, computationally intensive problems, the gain in execution speed from parallelism will usually justify the amount of effort required to exploit it, especially if the compiler rather than the programmer does most of the work. > >Debugging is a whole other soapbox, but my experience is that debugging > >coding errors is not much more difficult than uniprocessors. What is hard > >(or "impossible" with current tools) is detecting race conditions and > >bottlenecks - i.e. CONCEPTUAL errors. Race conditions are typically a problem only when a compiler has been inappropriately "permitted" to optimize code that it normally would have refused. The obvious approach, of course, should be to debug in single- processor mode. >...Throwing a parallel machine at an average workload is like throwing a >vector machine at it, only more so: The difference between vendor "not to >be exceeded" speed for the machine and delivered throughput for the >workload can be two order of magnitude, dramatically reducing the real >price / performance. This was discussed by this group just a month or so ago. There are really two causes. The first is the structure of the code itself. Certainly, real life programs are often dominated by scalar code, and better-written code will be less likely to thwart the compiler. More importantly, "peak" speeds quoted are usually the output rate during a vector instruction, totally ignoring the vector startup time. Startup time can be significant and must be paid every time a vector instruction starts (or repeats if the loop length is greater than the vector length). When this is taken into account, the *real* maximum rates can be achieved (well, almost) by codes that are well-structured for the machine. For an "average workload", a multi-processor machine can be beneficial if the various processors are free to do other jobs when parallelism is inappropriate. >Top end clock rates are down to a few nanoseconds, and it doesn't appear >likely that subnanosecond rates are going to arrive before... Light (or electromagnetic pulses) can only travel so fast, so you have to make things smaller to make them faster (obviously). This is *exactly* why radically different approaches such as parallelism, etc. are *appropriate* courses to follow, rather than useless. It is the single processor that is becoming harder to speed up. >...it is a most difficult problem, and one which decades of parallel >processing research has not solved adequately... >When you couple this with the fact that after twenty years of various >kinds of software/hardware research, parallel processing is still >limited to the technology described by Dykstra in 63, and still has >the same problems in software development, it is probably fair to say >that little progress is being made in software either. My background is science, not computer architecture, so I do not fully appreciate the "decades of parallel processing research" or the "past twenty years of various kinds of software/hardware". But I am quite familiar with some approaches, and have seen that it does work. The Alliant FX/8 was based on Kuck (1976), and in fact, was designed to address many of the objections you raised about parallel processing's limitations. It certainly cannot be described as equivalent to anything from 20 years ago. It does have most of the features I have described above, including an intelligent compiler (you are right, the programmer should not have to explicitely deal with parallelism), parallel I/O capability, slowdown with vectorization for heavily-masked code (most vector machines behave this way), "other jobs" for free (non-busy) processors when parallelism is inappropriate, race-condition failure when the compiler is permitted to optimize inappropriately, good parallelism for some codes and none at all for others, significant vector startup time that reduces the "peak" speed to a more "real" speed, and, finally, a dependence, as with any other machine, on the programmer avoiding any of the pitfalls that will defeat the machine's best intentions. REFERENCES Kuck, D., "Parallel Processing of Ordinary Programs," Advances in Computing, Vol. 15, 119-179, M. Rubinoff and M.L. Yovits, Eds., Academic Press, 1976. ----------------------------------------------------------------------------- This is not an official Alliant response. I am simply an employee who feels differently than you do. I only mentioned the FX/8 because it is a good, real-world, functioning example that attempts to address the valid objections you raised, and because I understand it. Alliant had no input into this posting, nor did Alliant require me to write this disclaimer.