Path: utzoo!attcan!uunet!wuarchive!udel!ee.udel.edu From: new@ee.udel.edu (Darren New) Newsgroups: comp.os.misc Subject: Re: What constitutes a good OS? Message-ID: <42683@nigel.ee.udel.edu> Date: 24 Jan 91 18:39:49 GMT References: <5427@auspex.auspex.com> <42488@nigel.ee.udel.edu> <5461@auspex.auspex.com> Sender: usenet@ee.udel.edu Organization: University of Delaware Lines: 254 Nntp-Posting-Host: estelle.ee.udel.edu (Sorry if you see this twice. Our newsserver crashed in the middle of submitting it and I didn't see it show up at our site.) In article <5461@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: >Many systems, including UNIX-flavored ones such as UNIX and Plan 9, >provide, as a "platform", a "file system" that provides access to named >randomly-accessible collections of bytes, and treat text files, keyed >files of various sorts, and data structures of sorts often *not* >provided by various OS's "access methods" as applications atop that file >system. It doesn't bother me that keyed files in those systems aren't >at the lowest level of the file system; I see little if any win to that. What I think you are saying here is "why put a complex filesystem in the kernel when all the functionality can be provided in a library?" I would like to address this with an example. Lets look at a filesystem that is only marginally more complex than UNIX's. A file, when created, can have a flag set during creation that says "This is the new kind of file". (Necessary to allow room for the informative overhead.) This new file type is exactly like the old one except that two new calls are added: insert(fd, buf, size) inserts those bytes at the current file cursor, delete(fd, size) deletes that many bytes starting at the current file cursor. For simplicity, lets exclude lseek() on these files for now. (Note: in the following discussion, when I say "library" I mean user-mode application library, dynamic or not. When I say "kernel" or "filesystem" I mean a filesystem that has access to presumedly privledged information in the kernel such as device buffers, disk allocation bitmaps, etc and also has the ability to prevent or otherwise control process preemption and/or resumption. "kernel" does not necessarily mean resident, nor does library mean statically linked.) My proposed organization is to make each file a linked list of blocks. Each block has an additional three words of (non user-data) information in it: the previous block, the next block, and the number of bytes of used user data in this block. Clearly, in the simple cases, this is about as easy to implement in a user library as it is in a kernel. However, a kernel implementation has quite a few advantages: 1) When the need to insert a block arises, a block near (in an access-time sense) its predicessor or successor can be chosen. In a user library, the control of which block will be chosen is not available; even if it was, the interface would need to be complicated because the block could be used between the time it was found and the time it was inserted; hence, some sort of atomic call would be needed. Also, new blocks being inserted would probably be tacked onto the end of the file, where normal UNIX semantics might say "allocate a block `close' to the current last block of the file." 2) When a block is deleted from the middle of the file, the kernel version can return that block to the free space pool immediately. The user library will need to either write all zeros to it and hope that the kernel will notice this, or it will have to maintain a separate list of "free blocks". In either case, the fact that the block is free either has to be recorded somewhere in the file (hence taking up more room for the free list and not really being available for other files anyway) or the library must scan the entire file looking for a block that when read looks like all zeros. It's not even possible to store the free block list inside the free blocks because then they won't all be zeros. 3) The forward and backward links in the blocks as stored by the kernel can be direct diskblock numbers. In a user-library, they must be offsets from the beginning of the file. Hence, with the kernel, there need be no "extention blocks" in the inode and reading an entire file can be done as fast as it takes to physically read the disk. In the user-library implementation, blocks are not necessarily read in the order in which they appear in the file; hence, extension-block caching may not be efficient. Once the files get above a certain (admitedly large) size, you run out of extension blocks, which are not even needed in new-files. 4) The number of kernel/user context switches is much greater when the library is in user mode (requiring several separate calls to each of lseek, read, and write just to insert a block, with probably only a dozen instructions between each pair of calls). This is very costly on machines with large register sets or memory maps or where the cache is flushed on such switches. 5) semi-strawman: robustness is increased because it's generally harder for a user to screw up the kernel than it is for one to screw up a user library. I'm aware of the counter-arguments to this one. 6) semi-strawman #2: The memory and call-overhead penaltys are probably lower if the new file type is used often, as the dynamic library stubs don't have to be linked in, the dynamic library cache does not need to be stored, etc. That is, if it's always going to be around anyway, it's probably better to have it in the kernel than to go through the extra overhead of making it a usually-resident dynamic library that then accesses the kernel. For a slightly more complex specification, let's add multi-user shared access. An assumed requirement is that all calls are atomic. This is easy in the kernel, and difficult in the user library. It is easy in the kernel because locks can be modified without releasing them when information is inserted or deleted before them. (Here, I'm assuming that once you lock some bytes, the contents of those bytes cannot change and an insertion or deletion before those bytes will not "slide" them out from under the lock.) It is difficult in the user library not only because one process cannot modify another's locks, but also *in UNIX* locks lock ranges of bytes in the file. When the user requests to lock bytes 2000 through 2050, the library has no idea which bytes of the file it needs to lock; it must find the correct bytes to lock, which implies reading possibly locked structural information from the file. (Note that under some OSes, e.g. HP MCP and CP-V, locks are more symbolic and are disassociated from the files they are locking. This would make matters easier but not easy.) 1) one possibility is to have each user lock the file, make all changes, and then unlock the file, caching no information between calls. Basically, you have no multi-user shared access. The only overhead saved is the cost of openning and closing the file each time. Otherwise, all information you need has to be copied into and out of the kernel buffers on every call. This is why we invented buffered I/O and setbuf. 2) If you lock only the blocks you are working on, you cannot even cache that block between calls, as somebody may have written to it in the meantime. It also prevents scanning the file to find all-zero blocks. Without a mechanism to asynchronously notify you that a file block you are locking is being requested elsewhere, you cannot afford to hold locks for an arbitrary length of time. 3) Any sort of free-space list you may use will need to be locked on both reads and writes (so a writer does not delete a block that you then continue to read), again causing excessive contention. I would no more want to add this type of file in user-mode than I would want to implement directories in user-mode, if the directories allowed some processes to be in readdir() while others were in creat() or unlink(). I've not seen a single-tasking OS that turned into a multi-tasking OS where such conditions were not buggy for at least two OS releases. (It took AmigaDOS 4 releases to get this right, and the latest tests aren't even in yet.) Look at what the kernel can do, due to the fact that all users go through the *same* kernel to get to the file: 1) Arbitrary portions of the file can be cached in memory. 2) Multiple atomic operations can be carried out concurrently due to the fact that the kernel *knows* which areas of the file are under modification right now. 3) Read-ahead and write-behind are possible because the memory is being shared between processes. 4) Since chances are good that upon reading a mostly-empty block, the previous block will still be in memory, it may be possible to compress two half- empty blocks into a single block with no additional overhead (due to the write-behind). Before you say all this stuff is just for "efficiency" or "programmer's convenience", remember that interrupts, DMA, disk caching, demand-paging, and memory protection are all just "efficiency" and "programmer's convenience", as is the rest of the operating system. Let's add lseek()s to this new file type. First, how would I do it in the kernel? I could maintain an in-core table, created the first time a new-file is lseeked to somewhere other than the first or last byte. The table could give me the offset-in-bytes, the physical block number and the byte-within-block. If I only stored when seeking *from* the middle of the file, seeking back to anyplace I'd taken an ftell() from would be instantaneous; seeking elsewhere (say, ftell()+200) would require some amount of reading, but since it is in the kernel, buffers need not be copied into user space while skipping. Since this cache is in shared memory, somebody else deleting or inserting bytes could update the table offset-in-bytes entries. The cache could grow based on how much kernel memory is available, and shrink when *any* process anywhere needs that memory. The table or something like it could be stored along with the file (in a separate block, say) while the file is closed, for efficiency. When disk space is low, these extra blocks (which are only there for efficiency, after all) could be reaped out of the files on demand, possibly even based on LRU times. Most disks around here are >97% full most of the time, since most people only delete files when they run out of room; I won't accept the "so get a bigger disk" argument :-). Also, the number of blocks allocated to the file could be tracked, and when the density fell below a certain level, the file could automatically be compressed upon the last close, which the kernel is tracking anyway; you can do this in the user library, but then you need all sorts of extra code to keep track of file opens and closes and when the file is being compressed in addition to the compression code. Also, as above, freeing blocks as they are compressed is problematic. A user-mode deamon could copy the file to a new, compressed file and then delete the old version (as long as nobody else has it open), but then you need room for two copies of the file that you are compressing and the deamon has to be triggered and *find* the right files to compress; the deamon also has to be a trusted program. How could something like this be done in the user library? Well, we could just read the whole file every time we did a seek. Except that now we again run into any locks that anybody else has and we have to lock all that part of the file while we're reading. We also have several dozen kernel/user context switches and lots of copying between kernel memory and user memory. We could store the cache in shared memory, except that we don't have shared memory in BSD. Last time I looked System V has a single pool of non-pagable shared memory (which is in the kernel anyway), you can't find out when other processes need some and can't get it, and the name-space connection between that shared memory buffer and the file it is associated with is problematic. You also need to make up semaphores (also in the kernel memory and limited) to mediate access to the shared memory, also causing context switching and name resolution complexities. In addition, if the dynamic libraries are usually open anyway, all that code is in memory as well, so you are not saving memory by moving it to the user library. (Besides, you could page or overlay the parts of the kernel dealing with file structures, so if nobody is using the complex structures, you wouldn't waste resident real memory.) Another choice may be to implement the whole thing with caches in the shared memory, using mmap() or something similar. Note that if you did things this way, you would essentially be rewriting huge portions of the UNIX filesystem just to add this one tiny feature. You would be reimplementing disk-block allocation and deallocation, block buffering, contention resolution, file descriptor consistancy maintainance, and so on; you are essentially treating a UNIX file as a block-structured device and writing a new filesystem as a user library. Anyway, I think that makes some good points, and I would be interested in any discussion this may raise. I would be especially interested in hearing why putting such functionality is actually *better* than putting it in the kernel, rather than simply possible, which is all I've heard up to now. Another point: given that we have insert() and delete() added to the kernel, it becomes much simpler to add more complex structures as user-mode libraries if desired. For example, byte-counted records are simple. The file format is a bytecount, that many bytes of data, and another record. Lengthening and shortening records is no longer traumatic. KSAM files become much simpler, because the block split operation during addition of a new key is just the insert() operator with the right parameters. Libraries like dbm no longer have to return keys in an "arbitrary" order because they can put the keys in the right place to start with. Things like the Macintosh "resources" can be added to files efficiently. (BTW, Mac resources or keyed files are much like lightweight directories or lightweight dynamically-linked libraries. You can do it with the heavyweight versions, but it's much less efficient and uglier.) Insertion of key blocks into keyed files can insert them in an efficient-to-retreive location. Editors no longer have to suck up the entire file and blow it back out to add one character to the first line. If directory structures were based on new-file structures as opposed to the current UNIX-file structures, they could be maintained in sorted order, would automatically shrink when many files were deleted, and would have no wasted space in them. Hence, I claim that the UNIX filesystem is only marginally too weak to support complex file structures conveniently. -- --- Darren New --- Grad Student --- CIS --- Univ. of Delaware --- ----- Network Protocols, Graphics, Programming Languages, Formal Description Techniques (esp. Estelle), Coffee, Amigas ----- =+=+=+ Let GROPE be an N-tuple where ... +=+=+=