Path: utzoo!attcan!uunet!husc6!mailrus!cornell!uw-beaver!microsoft!w-colinp
From: w-colinp@microsoft.UUCP (Colin Plumb)
Newsgroups: comp.sys.amiga.tech
Subject: Faster File Systems
Keywords: BOOH
Message-ID: <55@microsoft.UUCP>
Date: 12 Dec 88 01:12:32 GMT
Reply-To: w-colinp@microsoft.UUCP (Colin Plumb)
Organization: Microsoft Corp., Redmond WA
Lines: 152
Confusion: Microsoft Corp., Redmond WA

Having gone through one file system design (Cogent XTM, if you've heard of it;
it's appeared in Byte a bit) and seen what great speed improvements can be
made by avoiding the Unix-style block allocation, I sat down to design a
file system specifically for Amiga floppies.  After all, they should be
extremely fast - no waiting for the right sector to come around, just
5.5*5 revs/sec = 27.5 K/sec.  But somehow they never quite make it.

This is the result of about a day and a half of thinking.  I throw it out
here for comment.  I particularly need to think more about making ExNext
faster.  (I think it's a pretty broken way to do directories; the "where
you are in a directory" state should be distinct from the lock and file info.
Even if this was done, there are performance hits in not allowing a person
to get less information about each file in a directory.)

Anyway, the Bat Out Of Hell file system:

Observation: on the Amiga, there are 2 costs to reading a sector:
- seek time, 3ms.track or thereabouts.  240ms (1/4 sec) worst-case.
  There's also head-settling time (15 ms?), but I forget how much.
- read time, 60 sec/min / 300rpm * 2 = 0.4 sec/track.  400 ms is
  significantly more expensive than even a worst-case seek.  Thus, while
  it is a *very* good idea to minimise the number of tracks a file is
  distributed among, it doesn't matter that much which ones.

Note that this suggests that if trackdisk.device only read the surface it
had to, things would go significantly faster.

Once you have a track, doing just about anything with its contents is
essentially free.  After all, decoding it involves a blitter pass.  A bunch
of data shuffling can't cost more.

This whole thing suggests an extent-based file system, where hopefully
extents are whole tracks.  The problem with extent-based file systems is
fragmentation.  A sector-based scheme lets you use any free space for any
allocation request, while an extent-based one may not have an extent of
the proper size.

But, we have observed that accessing any specific track takes approximately
1/2 second, wherever it is, and doing anything with a track is free.  Thus,
a scheme which uses extents on a track basis (and compacts garbage), but
allocates tracks one at a time.

Thus, the scheme proper:

A floppy consists of 80 tracks.  A track is 11K of data, and 352 bytes of
header-area data.  (Also known as "OS recovery info".)

All free space on a track is compacted, so the free map is simply a count
of free space on a track (16 bits) times 80 tracks - less than the current
220 bytes of bitmap!

A file inode contains pointers to extents, each containing a track number
and the amount of data stored there.  I'll call it an inode, even though
it contains file name, comment, etc, just like a regular AmigaDos file header
block.  It's also variable-sized, growing or shrinking when a comment is
changed, when renamed, or when an extent is added.  Note that any integral
number of bytes can be stored in an extent - great space efficiency, especially
for small (e.g. ENV: entries) files!

Each track, in the header area (and overflowing onto the track proper if
necessary), keeps an index of all the extents stored on that track.
Given an inode number and extent number, it can give you the offset into
the track at which the extent begins.  This way, you can compact the contents
of a track without having to fix up the inodes which point to it.

The track index also has an index of the inodes stored on the track, but mostly
they're treated the same as part of a file's data.  Note that, assuming
inodes are always larger than 13.75 bytes, you can't fit more than 64K of
them on a disk, so you can use 16-bit inode numbers (useful to keep the track
indeces small).

Now, there are two kinds of extents the file system worries about:
fixed-size ones (other than the last extent in a file - they're never going
to change size) and variable-sized ones (the last extent in a file, and inodes,
which can grow).  For the former, you want to pack them in to tracks as tight
as possible, while you want to leave some slack on any track that has the
latter.

So the algorithm for adding data to a file works like this:

- If there is no room in the file's last extent (get last extent's track
number, look at that part of the free map), then allocate an extent of the
apropriate size somewhere.  You may want to pick the track to minimise
seek time, but, as I said, except in extreme cases, this isn't crucial.

- If there is enough room in the file's last track to hold the data, then
repack the data on the track so the "hole" is immediately after this
extent (e.g. if it used to be aaaaabbcccddddeeeeee----------ff, and
you write to extent c, rearrange things so it's packed like
aaaaabbccc----------fddddeeeeeef, and write the data to the end of
extent c.  Note that this moves as much data as keeping the hole in the
same place and enlarging the extent just a little bit, but eliminates
the move if you subsequently write more data (likely).

- If there isn't enough room on the track, then even the obvious
"top off this track and allocate a new extent" will require writes to
two tracks, but there are better ways to use them:

-- If possible, reallocate the whole extent on a different track.  This
saves a track read in future, and avoids growing an inode.  Note that, if
there is a spare track on the disk, this algorithm will put 11K of a
slowly-growing file in it.
-- If not, try to find the largest available extent on a fixed-size track
(that is, one with only fixed-size extents on it), copy data into it starting
with the beginning of what's already on disk, and recurse for what remains to
be stored somewhere.  Hopefully, it will fit on the current track.  Together
with the last rule, this will make sure free tracks are used up by files that
can use them.
-- If all else fails, fill up a track with variable-sized extents on it.

The only other tricky bit lies in enlarging inodes, when an extent is added
to them or when they're renamed, since you can't split them.  you have to
move them, and fix up the pointer in the parent directory.  However, note
that it's safe to delete the inode from the current track and write it
somewhere else, and *then* fix up the parent, since even if the machine
crashed before the last stage of this operation, the parent's pointer to the
file can be proven bad (the inode isn't in the index).

Nasty point: file deletion or truncation is somewhat expensive, as we can't
just mark the blocks as free, we have to go to each track and add the block to
the hole.  Fortunately, we can take a short-cut with tracks on which the
deleted file is the only thing there and simply mark them as empty.  Since we
hope for a good number of these, it should reduce the work considerably.

Well?  For those who have made it to the end of this, I believe this will
provide fast access and near-optimal disk usage.  But I'll have to write
it to find out.  Any suggestions before I do that?  I do, as I said, need
to think about making ExNext faster.  The fact that inodes are *much* smaller
than they are under OFS should help.

Question: If I write a full track, does the trackdisk.device bother to read it
first?  After all, my FS knows when there's nothing on the track worth keeping.

(Oh, damn... I just realised that, if I can migrate inodes across track
boundaries, I'll have to fix up all the references on each extent.  I can
get around this by assigning each inode an address and a unique i.d. number
(16 bits would be enough if I could track them, but I should use 32 if I want
blind "i.d.'s will never get reused" to work) which the per-track extent
indices would use, but it's ugly and takes up more memory.  Any ideas?
One possibility is to try and migrate a data extent, or failing that, the
inode with the fewest extents to patch up, but that's pretty baroque.)

For the advanced: figure out how to leave some faint hope for file undeletion
with this scheme.  I have some ideas, but they're not pretty.  Note that for
recovering from a trashed file system things are great, as the track index
lets you identify all the bits of a file and put them in order.

More questions: should I bother allocating on 1-byte boundaries?  16-byte
boundaries will work almost as well, and let me pack track number and
# of bytes into 16 bits.
-- 
	-Colin (uunet!microsof!w-colinp)