Path: utzoo!attcan!uunet!lll-winken!ames!rex!uflorida!gatech!hubcap!wagner From: wagner@june.cs.washington.edu (Bullwinkle J. Moose) Newsgroups: comp.parallel Subject: Re: Distributed simulation Summary: Statistical considerations for parallel simulation Message-ID: <5312@hubcap.clemson.edu> Date: 27 Apr 89 20:03:51 GMT Sender: fpst@hubcap.clemson.edu Lines: 142 Approved: parallel@hubcap.clemson.edu In article <5276@hubcap.clemson.edu>, matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes: > In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes: > > Here is the setting. It's discrete-event simulation, which many or most > of your references are concerned with, on multiple-processor systems > (tightly or loosely coupled). Let P be the number of processors and T > be the amount of run time needed to get the desired statistical accuracy. > ... > Moreover, it seems to me that the best method is also the simplest -- > just have all P processors run the uniprocessor version of the program, > but for T/P amount of run time, instead of T. [Of course, each processor > needs to use a different random number seed.] > > Is anyone aware of any thorough work on this question, i.e. the question > of whether the complex methods are better or worse than the simple one? > I think there was a brief paper by Heidelberger on the effect of > transients in this context a few years ago in one of the Winter > Simulation Conferences, but it didn't go very far. Well, funny you should ask, because I have just finished writing up the chapter of my Ph.D. dissertation dealing with this question. In fact, Heidelberger is the only one I know of who has done any analysis of this problem. The complete references are: @InProceedings{ Heidelberger86, Author= "P. Heidelberger", Title= "{Statistical Analysis of Parallel Simulations}", Booktitle= "Proc. 1986 Winter Simulation Conference", Year = 1986, Pages= "290-295", Organization= SCS, Abstract= "Comparison of independent replications with a single long parallel simulation." } @TechReport{ Heidelberger87, Author= "P. Heidelberger", Title= "{Discrete Event Simulations and Parallel Processing: Statistical Properties}", Institution= "IBM Thomas J. Watson Research Center", Year = 1987, Number= "RC 12733 (\#57302)", Address= "Yorktown Heights, NY", Month= May, Abstract= "Proves that certain statistical estimators obtained by running separate replications in parallel are biased, and that some even converge to the wrong value!" } Heidelberger86 examined the tradeoff between intra-replication parallelism M and inter-replication parallelism K when the total number of processors is some constant (hence P = MK). In essence, the results of Heidelberger86 are that intra-replication parallelism should be favored over inter-replication parallelism under the following circumstances: for systems with a strong initial transient; for short runs; or if a large number of processors are available. While I have very little to add to this in an analytical sense, I've spent quite some time evaluating the practical impact of his results. While it's true that running parallel independent replications has a lot of appeal, there are some mitigating considerations. The first and most obvious of these is that beyond a certain point, the marginal effect on confidence interval width of increasing the number of replications becomes insignificant. Thus, given enough processors, it still makes sense to parallelize each replication. Second, because of the presence of an initialization bias in any stochastic simulation, it is important to run each replication long enough to ensure that the confidence intervals that one obtains for some estimator actually contain the true value of what is being estimated. (I.e., an extremely large number of very short replications will probably result in very small confidence intervals that do not contain the true value!) What this means in practical terms is that if elapsed wall clock time is the overriding consideration, you may not be able to achieve accurate results unless you parallelize each replication. Third, Heidelberger87 pointed out that as the number of replications run in parallel increases, statistical considerations require stopping rules that lead to poor processor utilization near the end of the simulation run. Grossly oversimplifying, it turns out that if the completion time of individual replications is a random variable, then in order to obtain unbiased estimators of model outputs it is necessary to wait for all replications to finish. (The randomness of completion times may be a significant factor when the simulation is estimating transient behavior of a model, or when the regenerative method for obtaining confidence intervals is being used within each replication.) In fact, under certain assumptions it turns out that as the number of processors P increases, the speedup obtained tends toward P/logP rather than P. (This is similar to a result by Kruskal and Weiss regarding the completion time of a group of self-scheduling processors taking tasks off of a central task queue.) Therefore, your assertion that > This method gives 100% utilization for all processors, ... is statistically untrue. Lastly, I have taken the analysis of Heidelberger86 and tweaked it just a bit for purposes of illustration. Rather than fixing a running time and comparing the mean square error (mse) of the alternatives obtained by assigning various values to M and K, I have fixed the mse and time to completion of the sequential case (M=1, K=P) and calculated what speedup the parallel simulation would need to achieve (as a function of M) in order to better the accuracy of the purely sequential method. I call this the break-even speedup. It can be interpreted in one of two ways: if the parallel simulation is capable of exceeding the break-even speedup, then either (a) the parallel strategy can produce the same level of statistical confidence as the purely sequential strategy in less elapsed time, or (b) the parallel strategy can produce a higher level of statistical confidence than the purely sequential strategy in the same amount of elapsed time. The advantage of this analysis is that it does not require any assumptions about the speedup behavior of the parallel simulation. The general behavior of the break-even speedup is that it is almost linear in M, but the slope of the line decreases rapidly as the total number of processors P increases. For really large P (e.g., 1024) the break-even speedup is truly paltry even for large M, so that any reasonable implementation of a parallel simulation ought to be able to produce a win. (Note: I'm not suggesting that it's realistic to expect a parallel simulator to show speedups on 1024 processors (in fact, that wouldn't give you anything on which to base confidence intervals); rather, given 1024 processors, one ought to run a M=32, K=32 parallel simulation, or, if your simulation starts to wheeze and die at 32 processors, maybe only M=16, K=64.) Finally, a pragmatic consideration against running multiple independent replications in parallel is that some simulations have enormous memory requirements, thus, it actually may not be possible to run the replications simultaneously. In fact, Jefferson et al. at JPL are running into situations in which they can't even calculate speedup because they can't run 1-processor versions of the problems they are simulating (they are running on a distributed memory architecture). Dave Wagner University of Washington Comp Sci Department wagner@cs.washington.edu {cornell,harvard,ssc-vax,tektronix}!uw-beaver!wagner P.S. My complete dissertation will be available from the University of Washington sometime this summer. The title is "Conservative Parallel Discrete-Event Simulation: Principles and Practice".