Path: utzoo!attcan!uunet!lll-winken!ames!rex!uflorida!gatech!hubcap!wagner
From: wagner@june.cs.washington.edu (Bullwinkle J. Moose)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation
Summary: Statistical considerations for parallel simulation
Message-ID: <5312@hubcap.clemson.edu>
Date: 27 Apr 89 20:03:51 GMT
Sender: fpst@hubcap.clemson.edu
Lines: 142
Approved: parallel@hubcap.clemson.edu

In article <5276@hubcap.clemson.edu>, matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes:
> In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes:
> 
> Here is the setting.  It's discrete-event simulation, which many or most 
> of your references are concerned with, on multiple-processor systems
> (tightly or loosely coupled).  Let P be the number of processors and T
> be the amount of run time needed to get the desired statistical accuracy.
> ...
> Moreover, it seems to me that the best method is also the simplest  --
> just have all P processors run the uniprocessor version of the program,
> but for T/P amount of run time, instead of T.  [Of course, each processor 
> needs to use a different random number seed.]
> 
> Is anyone aware of any thorough work on this question, i.e. the question
> of whether the complex methods are better or worse than the simple one?
> I think there was a brief paper by Heidelberger on the effect of 
> transients in this context a few years ago in one of the Winter 
> Simulation Conferences, but it didn't go very far.

Well, funny you should ask, because I have just finished writing up the
chapter of my Ph.D. dissertation dealing with this question.

In fact, Heidelberger is the only one I know of who has done any analysis
of this problem.  The complete references are:

    @InProceedings{ Heidelberger86,
	     Author= "P. Heidelberger",
	     Title= "{Statistical Analysis of Parallel Simulations}",
	     Booktitle= "Proc. 1986 Winter Simulation Conference",
	     Year = 1986,
	     Pages= "290-295",
	     Organization= SCS,
	     Abstract= "Comparison of independent replications with a single
			long parallel simulation."
	    }

    @TechReport{ Heidelberger87,
	     Author= "P. Heidelberger",
	     Title= "{Discrete Event Simulations and Parallel Processing: Statistical Properties}",
	     Institution= "IBM Thomas J. Watson Research Center",
	     Year = 1987,
	     Number= "RC 12733 (\#57302)",
	     Address= "Yorktown Heights, NY",
	     Month= May,
	     Abstract= "Proves that certain statistical estimators 
			obtained by running separate replications
			in parallel are biased, and that some even converge
			to the wrong value!"
	    }

Heidelberger86 examined the tradeoff between intra-replication
parallelism M and inter-replication parallelism K when the total number
of processors is some constant (hence P = MK).  In essence, the results
of Heidelberger86 are that intra-replication parallelism should be
favored over inter-replication parallelism under the following
circumstances:  for systems with a strong initial transient;  for short
runs;  or if a large number of processors are available.  While I have
very little to add to this in an analytical sense, I've spent quite
some time evaluating the practical impact of his results.

While it's true that running parallel independent replications has a
lot of appeal, there are some mitigating considerations.  The first and
most obvious of these is that beyond a certain point, the marginal
effect on confidence interval width of increasing the number of
replications becomes insignificant.  Thus, given enough processors, it
still makes sense to parallelize each replication.

Second, because of the presence of an initialization bias in any
stochastic simulation, it is important to run each replication long
enough to ensure that the confidence intervals that one obtains for
some estimator actually contain the true value of what is being
estimated.  (I.e., an extremely large number of very short replications
will probably result in very small confidence intervals that do not
contain the true value!) What this means in practical terms is that if
elapsed wall clock time is the overriding consideration, you may not be
able to achieve accurate results unless you parallelize each replication.

Third, Heidelberger87 pointed out that as the number of replications
run in parallel increases, statistical considerations require stopping
rules that lead to poor processor utilization near the end of the
simulation run.  Grossly oversimplifying, it turns out that if the
completion time of individual replications is a random variable, then
in order to obtain unbiased estimators of model outputs it is necessary
to wait for all replications to finish.  (The randomness of completion
times may be a significant factor when the simulation is estimating
transient behavior of a model, or when the regenerative method for
obtaining confidence intervals is being used within each replication.)
In fact, under certain assumptions it turns out that as the number of
processors P increases, the speedup obtained tends toward P/logP rather
than P.  (This is similar to a result by Kruskal and Weiss regarding
the completion time of a group of self-scheduling processors taking
tasks off of a central task queue.)  Therefore, your assertion that

> This method gives 100% utilization for all processors, ...

is statistically untrue.

Lastly, I have taken the analysis of Heidelberger86 and tweaked it just
a bit for purposes of illustration.  Rather than fixing a running time
and comparing the mean square error (mse)
of the alternatives obtained by assigning various values to M and K, I
have fixed the mse and time to completion of the sequential case (M=1,
K=P) and calculated what speedup the parallel simulation would need to
achieve (as a function of M) in order to better the accuracy of the
purely sequential method.  I call this the break-even speedup.  It can
be interpreted in one of two ways:  if the parallel simulation is
capable of exceeding the break-even speedup, then either (a) the
parallel strategy can produce the same level of statistical confidence
as the purely sequential strategy in less elapsed time, or (b) the
parallel strategy can produce a higher level of statistical confidence
than the purely sequential strategy in the same amount of elapsed
time.  The advantage of this analysis is that it does not require any
assumptions about the speedup behavior of the parallel simulation.

The general behavior of the break-even speedup is that it is almost
linear in M, but the slope of the line decreases rapidly as the total
number of processors P increases.  For really large P (e.g., 1024) the
break-even speedup is truly paltry even for large M, so that any
reasonable implementation of a parallel simulation ought to be able to
produce a win.  (Note:  I'm not suggesting that it's realistic to
expect a parallel simulator to show speedups on 1024 processors (in
fact, that wouldn't give you anything on which to base confidence
intervals); rather, given 1024 processors, one ought to run a M=32,
K=32 parallel simulation, or, if your simulation starts to wheeze and
die at 32 processors, maybe only M=16, K=64.)

Finally, a pragmatic consideration against running multiple independent
replications in parallel is that some simulations have enormous memory
requirements, thus, it actually may not be possible to run the
replications simultaneously.  In fact, Jefferson et al. at JPL
are running into situations in which they can't even calculate speedup
because they can't run 1-processor versions of the problems they are
simulating (they are running on a distributed memory architecture).

		Dave Wagner
		University of Washington Comp Sci Department
		wagner@cs.washington.edu
		{cornell,harvard,ssc-vax,tektronix}!uw-beaver!wagner

P.S.  My complete dissertation will be available from the University of
Washington sometime this summer.  The title is "Conservative Parallel
Discrete-Event Simulation: Principles and Practice".