Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!ll-xn!ames!oliveb!sun!gorodish!guy
From: guy%gorodish@Sun.COM (Guy Harris)
Newsgroups: comp.bugs.4bsd,comp.unix.wizards
Subject: Re: concurrent write(2) calls write bad data to file
Message-ID: <14589@sun.uucp>
Date: Fri, 6-Mar-87 07:32:06 EST
Article-I.D.: sun.14589
Posted: Fri Mar  6 07:32:06 1987
Date-Received: Sun, 8-Mar-87 05:45:51 EST
References: <692@rtech.UUCP>
Sender: news@sun.uucp
Reply-To: guy@sun.UUCP (Guy Harris)
Organization: Sun Microsystems, Mountain View
Lines: 70
Xref: mnetor comp.bugs.4bsd:206 comp.unix.wizards:1242

>This bug appears to exist only on 4.2-derived systems.

Well, I don't know about that.  You see, it's like this:

Process A does a "write" call.  It grabs the current value of the
file pointer and uses it as the write offset.  It then locks the
inode and goes in to write stuff.  The write requires a new block to
be allocated.  This may require I/O to be done; assume it does.  The
process blocks waiting for the I/O to complete, and process B gets
scheduled.

Since process A's "write" hasn't finished, the file pointer has NOT
been updated.  It grabs the same offset value that process A got.  It
can't write yet, though, because the inode is locked.  So it waits.

Process A now finishes its I/O and finishes the "write".  It unlocks
the inode and updates the file pointer by adding the number of bytes
it wrote.

Now assume that process A gives up the processor as soon as it returns from
the kernel, and process B gets the processor.  It now proceeds to
write *its* data *on top of* the data that process B wrote.   It
unlocks the inode, and returns, adding the number of bytes *it* wrote
to the file pointer.  Thus, the file pointer moves by the sum of the
number of bytes processes A and B wrote.

However, only the maximum of the two byte counts was actually written
to the file.  The file pointer now points some number of bytes *past*
the last byte written; the next "write" will write at that location,
leaving behind a hole filled with - you got it - zeroes.

This is borne out by

	1) the fact that in a test case I ran (the test program was
	   modified so that the parent counted *down* rather than *up*,
	   so that the parent and child would be more likely to be writing
	   different numbers of bytes), it clearly looked like the two
	   processes both tried to write a record to the *same*
	   location in the file - a location that started on a
	   512-byte boundary - and that the zeroes followed this
	   scrambled record

and

	2) the fact that when I changed the program to put the file
	   descriptor in forced-append mode (so that the writes
	   *never* overlap) the problem went away.

I don't see any obvious reason why this *couldn't* happen on any
UNIX system that didn't lock the file table entry while a write was
in progress, and no system I've worked with does so.  It may be that
due to the vagaries of the scheduler, and the amount of I/O done when
extending a file in small chunks, and things like that, it's *less
likely* to happen on a system using the V7 file system, but I don't
see that it's impossible on such a system.

In short, the problem is that UNIX has never been able to guarantee
that the file pointer is always valid; it's invalid while an I/O
operation is "in progress", but nothing prevents a process from using
the file pointer's value while it isn't valid.  The solution is
something like "use file locking" or "use forced append mode" or "use
something else that will keep a process from using the file pointer
value while a 'write' is in progress," assuming you can arrange that.

>I think I'm also running into a variant of this problem involving
>spurious nulls being written to a pipe when a signal occurs at just
>the wrong time, and another pipe write is done in the signal handler.

Not likely in 4.2BSD, since pipes don't go through the file system,
but go through the socket code.