Path: utzoo!attcan!uunet!husc6!mailrus!cornell!uw-beaver!microsoft!w-colinp From: w-colinp@microsoft.UUCP (Colin Plumb) Newsgroups: comp.sys.amiga.tech Subject: Faster File Systems Keywords: BOOH Message-ID: <55@microsoft.UUCP> Date: 12 Dec 88 01:12:32 GMT Reply-To: w-colinp@microsoft.UUCP (Colin Plumb) Organization: Microsoft Corp., Redmond WA Lines: 152 Confusion: Microsoft Corp., Redmond WA Having gone through one file system design (Cogent XTM, if you've heard of it; it's appeared in Byte a bit) and seen what great speed improvements can be made by avoiding the Unix-style block allocation, I sat down to design a file system specifically for Amiga floppies. After all, they should be extremely fast - no waiting for the right sector to come around, just 5.5*5 revs/sec = 27.5 K/sec. But somehow they never quite make it. This is the result of about a day and a half of thinking. I throw it out here for comment. I particularly need to think more about making ExNext faster. (I think it's a pretty broken way to do directories; the "where you are in a directory" state should be distinct from the lock and file info. Even if this was done, there are performance hits in not allowing a person to get less information about each file in a directory.) Anyway, the Bat Out Of Hell file system: Observation: on the Amiga, there are 2 costs to reading a sector: - seek time, 3ms.track or thereabouts. 240ms (1/4 sec) worst-case. There's also head-settling time (15 ms?), but I forget how much. - read time, 60 sec/min / 300rpm * 2 = 0.4 sec/track. 400 ms is significantly more expensive than even a worst-case seek. Thus, while it is a *very* good idea to minimise the number of tracks a file is distributed among, it doesn't matter that much which ones. Note that this suggests that if trackdisk.device only read the surface it had to, things would go significantly faster. Once you have a track, doing just about anything with its contents is essentially free. After all, decoding it involves a blitter pass. A bunch of data shuffling can't cost more. This whole thing suggests an extent-based file system, where hopefully extents are whole tracks. The problem with extent-based file systems is fragmentation. A sector-based scheme lets you use any free space for any allocation request, while an extent-based one may not have an extent of the proper size. But, we have observed that accessing any specific track takes approximately 1/2 second, wherever it is, and doing anything with a track is free. Thus, a scheme which uses extents on a track basis (and compacts garbage), but allocates tracks one at a time. Thus, the scheme proper: A floppy consists of 80 tracks. A track is 11K of data, and 352 bytes of header-area data. (Also known as "OS recovery info".) All free space on a track is compacted, so the free map is simply a count of free space on a track (16 bits) times 80 tracks - less than the current 220 bytes of bitmap! A file inode contains pointers to extents, each containing a track number and the amount of data stored there. I'll call it an inode, even though it contains file name, comment, etc, just like a regular AmigaDos file header block. It's also variable-sized, growing or shrinking when a comment is changed, when renamed, or when an extent is added. Note that any integral number of bytes can be stored in an extent - great space efficiency, especially for small (e.g. ENV: entries) files! Each track, in the header area (and overflowing onto the track proper if necessary), keeps an index of all the extents stored on that track. Given an inode number and extent number, it can give you the offset into the track at which the extent begins. This way, you can compact the contents of a track without having to fix up the inodes which point to it. The track index also has an index of the inodes stored on the track, but mostly they're treated the same as part of a file's data. Note that, assuming inodes are always larger than 13.75 bytes, you can't fit more than 64K of them on a disk, so you can use 16-bit inode numbers (useful to keep the track indeces small). Now, there are two kinds of extents the file system worries about: fixed-size ones (other than the last extent in a file - they're never going to change size) and variable-sized ones (the last extent in a file, and inodes, which can grow). For the former, you want to pack them in to tracks as tight as possible, while you want to leave some slack on any track that has the latter. So the algorithm for adding data to a file works like this: - If there is no room in the file's last extent (get last extent's track number, look at that part of the free map), then allocate an extent of the apropriate size somewhere. You may want to pick the track to minimise seek time, but, as I said, except in extreme cases, this isn't crucial. - If there is enough room in the file's last track to hold the data, then repack the data on the track so the "hole" is immediately after this extent (e.g. if it used to be aaaaabbcccddddeeeeee----------ff, and you write to extent c, rearrange things so it's packed like aaaaabbccc----------fddddeeeeeef, and write the data to the end of extent c. Note that this moves as much data as keeping the hole in the same place and enlarging the extent just a little bit, but eliminates the move if you subsequently write more data (likely). - If there isn't enough room on the track, then even the obvious "top off this track and allocate a new extent" will require writes to two tracks, but there are better ways to use them: -- If possible, reallocate the whole extent on a different track. This saves a track read in future, and avoids growing an inode. Note that, if there is a spare track on the disk, this algorithm will put 11K of a slowly-growing file in it. -- If not, try to find the largest available extent on a fixed-size track (that is, one with only fixed-size extents on it), copy data into it starting with the beginning of what's already on disk, and recurse for what remains to be stored somewhere. Hopefully, it will fit on the current track. Together with the last rule, this will make sure free tracks are used up by files that can use them. -- If all else fails, fill up a track with variable-sized extents on it. The only other tricky bit lies in enlarging inodes, when an extent is added to them or when they're renamed, since you can't split them. you have to move them, and fix up the pointer in the parent directory. However, note that it's safe to delete the inode from the current track and write it somewhere else, and *then* fix up the parent, since even if the machine crashed before the last stage of this operation, the parent's pointer to the file can be proven bad (the inode isn't in the index). Nasty point: file deletion or truncation is somewhat expensive, as we can't just mark the blocks as free, we have to go to each track and add the block to the hole. Fortunately, we can take a short-cut with tracks on which the deleted file is the only thing there and simply mark them as empty. Since we hope for a good number of these, it should reduce the work considerably. Well? For those who have made it to the end of this, I believe this will provide fast access and near-optimal disk usage. But I'll have to write it to find out. Any suggestions before I do that? I do, as I said, need to think about making ExNext faster. The fact that inodes are *much* smaller than they are under OFS should help. Question: If I write a full track, does the trackdisk.device bother to read it first? After all, my FS knows when there's nothing on the track worth keeping. (Oh, damn... I just realised that, if I can migrate inodes across track boundaries, I'll have to fix up all the references on each extent. I can get around this by assigning each inode an address and a unique i.d. number (16 bits would be enough if I could track them, but I should use 32 if I want blind "i.d.'s will never get reused" to work) which the per-track extent indices would use, but it's ugly and takes up more memory. Any ideas? One possibility is to try and migrate a data extent, or failing that, the inode with the fewest extents to patch up, but that's pretty baroque.) For the advanced: figure out how to leave some faint hope for file undeletion with this scheme. I have some ideas, but they're not pretty. Note that for recovering from a trashed file system things are great, as the track index lets you identify all the bits of a file and put them in order. More questions: should I bother allocating on 1-byte boundaries? 16-byte boundaries will work almost as well, and let me pack track number and # of bytes into 16 bits. -- -Colin (uunet!microsof!w-colinp)