Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!gatech!hubcap!matloff%crow.Berkeley.EDU
From: matloff%crow.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation
Message-ID: <5389@hubcap.clemson.edu>
Date: 3 May 89 11:53:00 GMT
Sender: fpst@hubcap.clemson.edu
Lines: 93
Approved: parallel@hubcap.clemson.edu

In article <5367@hubcap.clemson.edu> wagner@june.cs.washington.edu (Bullwinkle J. Moose) writes:
>In article <5319@hubcap.clemson.edu>, matloff@iris.ucdavis.edu (Norm Matloff) writes:

[I replied to Dave's article this morning, but apparently it never got
to the net, so I'm replying again.  The version here has been updated.]

*First of all, note that the steady-state of a stochastic process is
*represented not by a single value, but by a *distribution* of that value.

True, but if our aim is to reduce initialization bias, a single value
should work reasonably well.  See below.

*     [ In the following paragraph, Di is defined to be the delay experienced
*     by the ith customer to arrive at the server, and d is defined to be the
*     steady-state expected average delay, i.e. lim(n-->oo) (D1+D2+...+Dn)/n]

*    "Kelton and Law ["The Transient Behavior of the M/M/s Queue with
*     Implications for Steady-State Simulation", TR 82-7, Dept. of MIS,
*     U. of Arizona, 1982] computed E(Di) as a function of i and the number
*     in system at time 0, say, l [lower-case l, not 1] for various M/M/s
*     queues....
*     Let i(l,0.01) [be] the smallest point [value of i]
*     beyond which all values of E(Di) fall withing 1 percent of d,
*     viewed as a function of the initial number in the system, l.
*     Kelton and Law found for the M/M/1 queue with rho=0.9 that i(l,0.01)
*     [^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^]
*     is minimized for l = 15.  This result is interesting since
*     [^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^]
*     the steady-state time-average number of customers in system is 9.
*     [^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^]

Yes, 9 is fairly different from 15.  However, it doesn't make much
difference in initialization bias.  In Law's graph, the l = 10 and 
l = 15 lines (there is none for l = 9) give about the same sized
initialization bias.  In other words, just a "ballpark" value is
probably good enough to reduce the bias to acceptable scale.

>> Or, better yet, in systems which have regeneration points and don't
>> have huge (in the notation from my posting yesterday, >> T/P)
>> regeneration cycle times, one can simply start each replicate at
>> the regeneration point  --  thus no initialization bias at all.

*True, the method of regeneration theoretically does not suffer from the
*initialization bias problem.  But "theoretically" is the key word here,
*because unless what you are simulating is truly a regenerative process,
*your regeneration points are only approximate.  

Again, I think an approximate regeneration point is good enough for
practical purposes.

>> Of course, you do run into the problem you mentioned, i.e. the
>> problem of "waiting for the slowest one," which gave you your
>> P/log P figure.  However, analogous kinds of problem should plague
>> the task-partitioning approach too, e.g. as P increases, the
>> problem of idle time should get worse and worse.

*But, since the number of replications being run when intra-replication
*parallelism is used in less than P, the magnitude of the effect is smaller.

I think we're talking about different things here.  What I am talking
about occurs even if there is no replication.  What I am saying is that
due to interprocess dependencies, there will be process waiting involved,
and that this should in many applications have a log P growth rate, just
as in the replication case, even though for different reasons.

>> The task-partitioning approach also has a scaling problem.  For
>> a fixed application, say a 10-server queue, there will be a very
>> small limit on the usable value of P.  After a certain point,
>> you just can't find anything for the processors to do.

*Definitely true, which is why the best approach for a large number of
*processors is probably a combination of inter- and intra- replication
*parallelism.

This may well turn out to be the case.  But I still feel that this has
not been looked at sufficiently.  Your work sounds excellent, but you
yourself indicated that it's the only work to be done since Heidleberger's,
which itself was just a preliminary investigation.

>> In most simulations of queueing systems in practice, memory should not
>> be a problem.  The length of the event list should be on the order of 
>> magnitude of the number of servers, which is generally not large.

*Depends on what is being simulated.  I have friends at Bellcore who were
*simulating a metropolitan trunk network in which the steady-state size of
*the event list was about 15 THOUSAND events.

I heard John Comfort give a talk once in which he mentioned hearing of
some application in which the typical size of the event list was
100,000 events.  That's why I qualified my statement above by using
the word "most." :-)

   Norm