Path: utzoo!utgpu!cunews!bnrgate!brtph3!brchh104!brchs1!bnr.ca!rice.edu!sun-spots-request From: rkc@xn.ll.mit.edu Newsgroups: comp.sys.sun Subject: lockf Keywords: Miscellaneous Message-ID: <2237@brchh104.bnr.ca> Date: 27 Mar 91 21:35:00 GMT Sender: news@brchh104.bnr.ca Organization: Sun-Spots Lines: 55 Approved: Sun-Spots@rice.edu X-Original-Date: Tue, 26 Mar 91 11:23:24 EST X-Sun-Spots-Digest: Volume 10, Issue 69, message 27 X-Note: Submissions: sun-spots@rice.edu, Admin: sun-spots-request@rice.edu I have written an application that is similar to a network database application in which data is stored in on NFS-accessable file. To protect from multiple simultaneous updates, I have used the lockf subroutine to lock the entire file. I have had numerous problems with the client lockd deamon getting confused and forgetting to unlock a lock that is in place, while the server thinks an unlock has succeeded. Explicitely, a ps -aux on machine B produces the following: USER PID %CPU %MEM SZ RSS TT STAT START TIME COMMAND username 4484 4.7 0.0 64 0 ? D 17:45 0:39 lock_program while machine C can successfully aquire the lock, go about its business, and release the lock. (Machine A is the host for the filesystem where the file resides, the filesystem is hardmounted on the client systems.) A final clue to my problems is that lock_program is sometimes run by the at program. My questions are three: 1. Once machine B is confused, how do I unconfuse it? Specifically, PID 4484 cannot be killed, as it is waiting in a non-interruptable state for a resource. Killing and restarting both lockd and statd on the host and client machines neither lets the above process continue nor allows others to access the lock. The only fix appears to be rebooting the machine. (Other users don't like this too much!) 2. Am I doing something inherently wrong? In order to avoid processes being killed when they own the lock, I catch the following signals: signal( SIGHUP, clnp ); signal( SIGQUIT, clnp ); signal( SIGINT, clnp ); signal( SIGILL, clnp ); signal( SIGIOT, clnp ); signal( SIGEMT, clnp ); signal( SIGFPE, clnp ); signal( SIGBUS, clnp ); signal( SIGSEGV, clnp ); signal( SIGSYS, clnp ); signal( SIGTERM, clnp ); should I catch more? Here's what the lock code looks like: for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){ if( lockf(fd, F_TLOCK, 0L ) != (-1)) { success = TRUE; break; } sleep(2); } I avoid the indefinate wait lock because this appears to increase the probability that the error will occur. 3. Would creating a lock file via open be a workable network solution? Are their other workarounds (semaphores, etc) that I should try? I would prefer to get this to work properly using lockf, since this seems to be exactly what lockf is designed for. Our network consists of sparcstation 1+'s running either 4.0.1 or 4.1, and sun3's running 4.0. In the near future we will also be using DG's aviion/UX workstations. Thanks for any help you can provide, -Rob