Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!decvax!bellcore!ulysses!allegra!princeton!caip!nike!styx!ttrdc.UUCP!levy
From: levy@ttrdc.UUCP
Newsgroups: mod.computers.vax
Subject: (none)
Message-ID: <8605300956.AA06724@ucbvax.Berkeley.EDU>
Date: Fri, 30-May-86 05:56:54 EDT
Article-I.D.: ucbvax.8605300956.AA06724
Posted: Fri May 30 05:56:54 1986
Date-Received: Sat, 31-May-86 15:44:41 EDT
Sender: daemon@styx.UUCP
Organization: The ARPA Internet
Lines: 48
Approved: info-vax@sri-kl.arpa

Subject: Re: (VMS) Caveat midnight jobs in clusters ...
In-reply-to: your article <8605210131.AA14432@ucbvax.Berkeley.EDU>

In article <8605210131.AA14432@ucbvax.Berkeley.EDU>, OMOND@DHDEMBL5.BITNET (Roy Omond) writes:
>A word of warning for cluster users :
>We run a small cluster here consisting of an 8600 and an 11/785.
>The cluster had been running uninterrupted for something like 35 days.
>There appears to be a slight discrepancy in the clocks of both machines,
>such that the clock times on both nodes wander slightly apart.  After the
>35 days, the 8600 clock time was about 1 minute faster than on the 785.
>One of our users has a batch job which does the following :
>1) submits itself to run "/After=Tomorrow" (i.e. at midnight the
>   following day)
>2) updates a database
>3) produces a report
>4) terminates
>Last night the following happened :
>The 8600 time was 00:00, the time on the 785 23:59
>The queue manager on the 8600 released the batch job (in a generic
>queue) and the job started to run on the 785.
>The first thing the job did was to resubmit itself "/After=Tomorrow"
>which, because the time was still 23:59:20, meant in about 40 seconds
>time.
>Fortunately I was logged in from home and noticed fairly quickly that
>something was amiss.  And fortunately the chaos was fairly easy to
>undo.  Nevertheless, I can easily imagine the situation where costly
>damage could have been done.
>So the moral of the story is :
>        If you're running a cluster, make sure that the clock time
>        on all nodes is as synchronized as possible.
>        Avoid submitting jobs to run at exactly midnight.  Much
>        better (if possible) is, let's say, Tomorrow+00:10:00.
>I hope this saves someone somewhere some effort.

Another question:  why couldn't the job be set up to resubmit itself AFTER
it had finished, not before?  I suppose that if the machine crashed during
the job that would keep it from being resubmitted the next night, but
hopefully the user concerned would learn about the crash the next day and
be able to resubmit manually, if appropriate (perhaps a resubmit would NOT
be wanted in case of a crash).  Any other reason?
--
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer or the administrator of any computer
| at&t computer systems division |  upon which I may hack.
|        skokie, illinois        |
 --------------------------------   Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
						vax135}!ttrdc!levy