Path: utzoo!attcan!uunet!wuarchive!udel!ee.udel.edu
From: new@ee.udel.edu (Darren New)
Newsgroups: comp.os.misc
Subject: Re: What constitutes a good OS?
Message-ID: <42683@nigel.ee.udel.edu>
Date: 24 Jan 91 18:39:49 GMT
References: <5427@auspex.auspex.com> <42488@nigel.ee.udel.edu> <5461@auspex.auspex.com>
Sender: usenet@ee.udel.edu
Organization: University of Delaware
Lines: 254
Nntp-Posting-Host: estelle.ee.udel.edu

(Sorry if you see this twice. Our newsserver crashed in the middle
of submitting it and I didn't see it show up at our site.)

In article <5461@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>Many systems, including UNIX-flavored ones such as UNIX and Plan 9,
>provide, as a "platform", a "file system" that provides access to named
>randomly-accessible collections of bytes, and treat text files, keyed
>files of various sorts, and data structures of sorts often *not*
>provided by various OS's "access methods" as applications atop that file
>system.  It doesn't bother me that keyed files in those systems aren't
>at the lowest level of the file system; I see little if any win to that.

What I think you are saying here is "why put a complex filesystem in
the kernel when all the functionality can be provided in a library?" I
would like to address this with an example.  Lets look at a filesystem
that is only marginally more complex than UNIX's. A file, when created,
can have a flag set during creation that says "This is the new kind of
file".  (Necessary to allow room for the informative overhead.) This
new file type is exactly like the old one except that two new calls are
added:  insert(fd, buf, size) inserts those bytes at the current file
cursor, delete(fd, size) deletes that many bytes starting at the
current file cursor.  For simplicity, lets exclude lseek() on these
files for now.

(Note: in the following discussion, when I say "library" I mean
user-mode application library, dynamic or not. When I say "kernel" or
"filesystem" I mean a filesystem that has access to presumedly
privledged information in the kernel such as device buffers, disk
allocation bitmaps, etc and also has the ability to prevent or
otherwise control process preemption and/or resumption. "kernel" does
not necessarily mean resident, nor does library mean statically
linked.)

My proposed organization is to make each file a linked list of blocks.
Each block has an additional three words of (non user-data) information
in it: the previous block, the next block, and the number of bytes of
used user data in this block.  Clearly, in the simple cases, this is
about as easy to implement in a user library as it is in a kernel.
However, a kernel implementation has quite a few advantages:

1) When the need to insert a block arises, a block near (in an
access-time sense) its predicessor or successor can be chosen. In a
user library, the control of which block will be chosen is not
available; even if it was, the interface would need to be complicated
because the block could be used between the time it was found and the
time it was inserted; hence, some sort of atomic call would be needed.
Also, new blocks being inserted would probably be tacked onto the end
of the file, where normal UNIX semantics might say "allocate a block
`close' to the current last block of the file."

2) When a block is deleted from the middle of the file, the kernel
version can return that block to the free space pool immediately. The
user library will need to either write all zeros to it and hope that
the kernel will notice this, or it will have to maintain a separate
list of "free blocks".  In either case, the fact that the block is free
either has to be recorded somewhere in the file (hence taking up more
room for the free list and not really being available for other files
anyway) or the library must scan the entire file looking for a block
that when read looks like all zeros.  It's not even possible to store
the free block list inside the free blocks because then they won't all
be zeros.

3) The forward and backward links in the blocks as stored by the kernel
can be direct diskblock numbers. In a user-library, they must be
offsets from the beginning of the file. Hence, with the kernel, there
need be no "extention blocks" in the inode and reading an entire file
can be done as fast as it takes to physically read the disk. In the
user-library implementation, blocks are not necessarily read in the
order in which they appear in the file; hence, extension-block caching
may not be efficient. Once the files get above a certain (admitedly
large) size, you run out of extension blocks, which are not even needed
in new-files.

4) The number of kernel/user context switches is much greater when the
library is in user mode (requiring several separate calls to each of
lseek, read, and write just to insert a block, with probably only a
dozen instructions between each pair of calls). This is very costly on
machines with large register sets or memory maps or where the cache is
flushed on such switches.

5) semi-strawman: robustness is increased because it's generally harder
for a user to screw up the kernel than it is for one to screw up a user
library. I'm aware of the counter-arguments to this one.

6) semi-strawman #2: The memory and call-overhead penaltys are probably
lower if the new file type is used often, as the dynamic library stubs
don't have to be linked in, the dynamic library cache does not need to
be stored, etc.  That is, if it's always going to be around anyway,
it's probably better to have it in the kernel than to go through the
extra overhead of making it a usually-resident dynamic library that
then accesses the kernel.

For a slightly more complex specification, let's add multi-user shared
access.  An assumed requirement is that all calls are atomic.  This is
easy in the kernel, and difficult in the user library. It is easy
in the kernel because locks can be modified without releasing them
when information is inserted or deleted before them. (Here, I'm
assuming that once you lock some bytes, the contents of those
bytes cannot change and an insertion or deletion before those
bytes will not "slide" them out from under the lock.) It is
difficult in the user library not only because one process
cannot modify another's locks, but also *in UNIX* locks lock
ranges of bytes in the file. When the user requests to
lock bytes 2000 through 2050, the library has no idea which
bytes of the file it needs to lock; it must find the correct
bytes to lock, which implies reading possibly locked structural
information from the file. (Note that under some OSes, e.g.
HP MCP and CP-V, locks are more symbolic and are disassociated
from the files they are locking. This would make matters
easier but not easy.)

1) one possibility is to have each user lock the file, make all
changes, and then unlock the file, caching no information between
calls. Basically, you have no multi-user shared access.  The only
overhead saved is the cost of openning and closing the file each time.
Otherwise, all information you need has to be copied into and out of
the kernel buffers on every call. This is why we invented buffered I/O
and setbuf.

2) If you lock only the blocks you are working on, you cannot even
cache that block between calls, as somebody may have written to it in
the meantime. It also prevents scanning the file to find all-zero
blocks.  Without a mechanism to asynchronously notify you that a file
block you are locking is being requested elsewhere, you cannot afford
to hold locks for an arbitrary length of time.

3) Any sort of free-space list you may use will need to be locked on
both reads and writes (so a writer does not delete a block that you
then continue to read), again causing excessive contention.

I would no more want to add this type of file in user-mode than I would
want to implement directories in user-mode, if the directories allowed
some processes to be in readdir() while others were in creat() or
unlink(). I've not seen a single-tasking OS that turned into a
multi-tasking OS where such conditions were not buggy for at least two
OS releases. (It took AmigaDOS 4 releases to get this right, and the
latest tests aren't even in yet.)

Look at what the kernel can do, due to the fact that all users go
through the *same* kernel to get to the file:

1) Arbitrary portions of the file can be cached in memory. 
2) Multiple atomic operations can be carried out concurrently due to
   the fact that the kernel *knows* which areas of the file are under
   modification right now.
3) Read-ahead and write-behind are possible because the memory is
   being shared between processes.
4) Since chances are good that upon reading a mostly-empty block, the previous
   block will still be in memory, it may be possible to compress two half-
   empty blocks into a single block with no additional overhead (due to
   the write-behind).

Before you say all this stuff is just for "efficiency" or "programmer's
convenience", remember that interrupts, DMA, disk caching,
demand-paging, and memory protection are all just "efficiency" and
"programmer's convenience", as is the rest of the operating system.


Let's add lseek()s to this new file type. First, how would I do it in
the kernel?

I could maintain an in-core table, created the first time a new-file is
lseeked to somewhere other than the first or last byte. The table could
give me the offset-in-bytes, the physical block number and the
byte-within-block. If I only stored when seeking *from* the middle of
the file, seeking back to anyplace I'd taken an ftell() from would be
instantaneous; seeking elsewhere (say, ftell()+200) would require some
amount of reading, but since it is in the kernel, buffers need not be
copied into user space while skipping. Since this cache is in shared
memory, somebody else deleting or inserting bytes could update the
table offset-in-bytes entries. The cache could grow based on how much
kernel memory is available, and shrink when *any* process anywhere
needs that memory.  The table or something like it could be stored
along with the file (in a separate block, say) while the file is
closed, for efficiency. When disk space is low, these extra blocks
(which are only there for efficiency, after all) could be reaped out of
the files on demand, possibly even based on LRU times.  Most disks
around here are >97% full most of the time, since most people only
delete files when they run out of room; I won't accept the "so get a
bigger disk" argument :-).  Also, the number of blocks allocated to the
file could be tracked, and when the density fell below a certain level,
the file could automatically be compressed upon the last close, which
the kernel is tracking anyway; you can do this in the user library, but
then you need all sorts of extra code to keep track of file opens and
closes and when the file is being compressed in addition to the
compression code. Also, as above, freeing blocks as they are compressed
is problematic. A user-mode deamon could copy the file to a new,
compressed file and then delete the old version (as long as nobody else
has it open), but then you need room for two copies of the file that
you are compressing and the deamon has to be triggered and *find* the
right files to compress; the deamon also has to be a trusted program.

How could something like this be done in the user library? Well, we
could just read the whole file every time we did a seek. Except that
now we again run into any locks that anybody else has and we have to
lock all that part of the file while we're reading. We also have
several dozen kernel/user context switches and lots of copying between
kernel memory and user memory.  We could store the cache in shared
memory, except that we don't have shared memory in BSD. Last time I
looked System V has a single pool of non-pagable shared memory (which
is in the kernel anyway), you can't find out when other processes need
some and can't get it, and the name-space connection between that
shared memory buffer and the file it is associated with is
problematic.  You also need to make up semaphores (also in the kernel
memory and limited) to mediate access to the shared memory, also
causing context switching and name resolution complexities. In
addition, if the dynamic libraries are usually open anyway, all that
code is in memory as well, so you are not saving memory by moving it to
the user library.  (Besides, you could page or overlay the parts of the
kernel dealing with file structures, so if nobody is using the complex
structures, you wouldn't waste resident real memory.)

Another choice may be to implement the whole thing with caches in the
shared memory, using mmap() or something similar. Note that if you did
things this way, you would essentially be rewriting huge portions of
the UNIX filesystem just to add this one tiny feature. You would be
reimplementing disk-block allocation and deallocation, block buffering,
contention resolution, file descriptor consistancy maintainance, and so
on; you are essentially treating a UNIX file as a block-structured
device and writing a new filesystem as a user library.

Anyway, I think that makes some good points, and I would be interested
in any discussion this may raise.  I would be especially interested in
hearing why putting such functionality is actually *better* than
putting it in the kernel, rather than simply possible, which is all
I've heard up to now.

Another point: given that we have insert() and delete() added to the
kernel, it becomes much simpler to add more complex structures as
user-mode libraries if desired. For example, byte-counted records are
simple. The file format is a bytecount, that many bytes of data, and
another record. Lengthening and shortening records is no longer
traumatic. KSAM files become much simpler, because the block split
operation during addition of a new key is just the insert() operator
with the right parameters. Libraries like dbm no longer have to return
keys in an "arbitrary" order because they can put the keys in the right
place to start with. Things like the Macintosh "resources" can be added
to files efficiently. (BTW, Mac resources or keyed files are much like
lightweight directories or lightweight dynamically-linked libraries.
You can do it with the heavyweight versions, but it's much less
efficient and uglier.) Insertion of key blocks into keyed files can
insert them in an efficient-to-retreive location. Editors no longer
have to suck up the entire file and blow it back out to add one
character to the first line. If directory structures were based on
new-file structures as opposed to the current UNIX-file structures,
they could be maintained in sorted order, would automatically shrink
when many files were deleted, and would have no wasted space in them.
Hence, I claim that the UNIX filesystem is only marginally too weak to
support complex file structures conveniently.
-- 
--- Darren New --- Grad Student --- CIS --- Univ. of Delaware ---
----- Network Protocols, Graphics, Programming Languages, 
      Formal Description Techniques (esp. Estelle), Coffee, Amigas -----
              =+=+=+ Let GROPE be an N-tuple where ... +=+=+=