Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!decvax!decwrl!pyramid!pesnta!hplabs!ucbvax!DHDEMBL5.BITNET!OMOND From: OMOND@DHDEMBL5.BITNET.UUCP Newsgroups: mod.computers.vax Subject: (VMS) Caveat midnight jobs in clusters ... Message-ID: <8605210131.AA14432@ucbvax.Berkeley.EDU> Date: Fri, 16-May-86 17:52:17 EDT Article-I.D.: ucbvax.8605210131.AA14432 Posted: Fri May 16 17:52:17 1986 Date-Received: Wed, 21-May-86 06:19:35 EDT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 44 Approved: info-vax@sri-kl.arpa A word of warning for cluster users : We run a small cluster here consisting of an 8600 and an 11/785. The cluster had been running uninterrupted for something like 35 days. There appears to be a slight discrepancy in the clocks of both machines, such that the clock times on both nodes wander slightly apart. After the 35 days, the 8600 clock time was about 1 minute faster than on the 785. One of our users has a batch job which does the following : 1) submits itself to run "/After=Tomorrow" (i.e. at midnight the following day) 2) updates a database 3) produces a report 4) terminates Last night the following happened : The 8600 time was 00:00, the time on the 785 23:59 The queue manager on the 8600 released the batch job (in a generic queue) and the job started to run on the 785. The first thing the job did was to resubmit itself "/After=Tomorrow" which, because the time was still 23:59:20, meant in about 40 seconds time. The job continued on its normal course, updating the database. 40 seconds later the next job, which should have been held until the next day, was released and started to run. Well, you can imagine what sort of chaos ensued when the "same" job was updating the same database with the same updates at the same time ... :-) Fortunately I was logged in from home and noticed fairly quickly that something was amiss. And fortunately the chaos was fairly easy to undo. Nevertheless, I can easily imagine the situation where costly damage could have been done. So the moral of the story is : If you're running a cluster, make sure that the clock time on all nodes is as synchronized as possible. Avoid submitting jobs to run at exactly midnight. Much better (if possible) is, let's say, Tomorrow+00:10:00. I hope this saves someone somewhere some effort.