Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!decvax!decwrl!pyramid!pesnta!hplabs!ucbvax!DHDEMBL5.BITNET!OMOND
From: OMOND@DHDEMBL5.BITNET.UUCP
Newsgroups: mod.computers.vax
Subject: (VMS) Caveat midnight jobs in clusters ...
Message-ID: <8605210131.AA14432@ucbvax.Berkeley.EDU>
Date: Fri, 16-May-86 17:52:17 EDT
Article-I.D.: ucbvax.8605210131.AA14432
Posted: Fri May 16 17:52:17 1986
Date-Received: Wed, 21-May-86 06:19:35 EDT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 44
Approved: info-vax@sri-kl.arpa


A word of warning for cluster users :

We run a small cluster here consisting of an 8600 and an 11/785.
The cluster had been running uninterrupted for something like 35 days.
There appears to be a slight discrepancy in the clocks of both machines,
such that the clock times on both nodes wander slightly apart.  After the
35 days, the 8600 clock time was about 1 minute faster than on the 785.

One of our users has a batch job which does the following :
1) submits itself to run "/After=Tomorrow" (i.e. at midnight the
   following day)
2) updates a database
3) produces a report
4) terminates

Last night the following happened :

The 8600 time was 00:00, the time on the 785 23:59
The queue manager on the 8600 released the batch job (in a generic
queue) and the job started to run on the 785.

The first thing the job did was to resubmit itself "/After=Tomorrow"
which, because the time was still 23:59:20, meant in about 40 seconds
time.  The job continued on its normal course, updating the database.
40 seconds later the next job, which should have been held until the
next day, was released and started to run.  Well, you can imagine what
sort of chaos ensued when the "same" job was updating the same database
with the same updates at the same time ... :-)

Fortunately I was logged in from home and noticed fairly quickly that
something was amiss.  And fortunately the chaos was fairly easy to
undo.  Nevertheless, I can easily imagine the situation where costly
damage could have been done.

So the moral of the story is :

        If you're running a cluster, make sure that the clock time
        on all nodes is as synchronized as possible.

        Avoid submitting jobs to run at exactly midnight.  Much
        better (if possible) is, let's say, Tomorrow+00:10:00.

I hope this saves someone somewhere some effort.