Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!ncar!mephisto!mcnc!rti!trt
From: trt@rti.rti.org (Thomas Truscott)
Newsgroups: comp.unix.wizards
Subject: Re: Is HDB locking safe?
Summary: Yes, HDB locking is safe!
Message-ID: <4024@rtifs1.UUCP>
Date: 15 Aug 90 19:46:12 GMT
References: <577@oglvee.UUCP>
Organization: Research Triangle Institute, RTP, NC
Lines: 62

> ... HDB assumes that if the pid recorded
> in the lock file no longer corresponds to an active process, the lock file is
> defunct and can safely be removed.  I can't for the life of me figure out a
> safe way of doing this.

A crucial detail in recovering from a breakdown in the lock protocol
is avoiding a race between two or more processes that are simultaneously
attempting recovery.  Usually a strategic pause is all that is needed,
and as you can see in the HDB code below there is just such a pause.

> static int
> checklock(lockfile)
> char *lockfile;
> {
> 	...
> 	if ((lfd = open(lockfile, 0)) < 0)
> 		return(0);
> 	...
> 	if ((kill(lckpid, 0) == -1) && (errno == ESRCH)) {
> 		/*
> 		 * If the kill was unsuccessful due to an ESRCH error,
> 		 * that means the process is no longer active and the
> 		 * lock file can be safely removed.
> 		 */
> 		unlink(lockfile);
> 		sleep(5);		/* avoid a possible race */
> 		return(1);
> 	}
> 
> In this code there is no guarantee that lfd and lockfile correspond to the
> same file at the time of the unlink.

But there *is* a guarantee -- the "sleep(5);"!!
[I changed the sleep() line to match the one in 4.3 BSD uucp "ulockf.c"]

Consider a process "X" that discovers that the locking
process has terminated.  X unlinks the lockfile,
but then it *pauses* before it attempts to grab the lock for itself
(done by code not shown above).

Now consider scenario #1 for another process "Y":
At nearly the same instant Y discovers
the dead lock, so it also unlinks the lockfile
(of course only one unlink can succeed) and it *also pauses*.
Whenever X and/or Y resume there is no lock present,
so attempts to grab it proceed in the usual way (code not shown above).

Now consider scenario #2 for Y:
Just after X has unlinked the lockfile, Y calls checklock()
and discovers no lock is present.  No problem, it just
attempts to grab the lock in the usual way (code not shown above).
When X awakes from its slumber it will discover that Y has
already grabbed the lock, so X will just have to wait.

The HDB code is nice, but does have flaws:
(a) A "sleep(1);" is not enough to avoid a race on a very busy system.
(b) Lock recovery is obscure, so the sleep() call should be commented.
(c) Protocol breakdown is a bad thing, and should be reported:
	logent(lockfile, "DEAD LOCK");
The 4.3 BSD ulockf.c routine has all of these nice features.

	Tom Truscott