Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!lll-lcc!styx!ames!ucbcad!ucbvax!sdcsvax!darrell
From: darrell@sdcsvax.UUCP
Newsgroups: mod.os
Subject: Who needs files. [Really "Apollo's been doing it for 6 years"]
Message-ID: <2906@sdcsvax.UCSD.EDU>
Date: Mon, 23-Mar-87 12:51:51 EST
Article-I.D.: sdcsvax.2906
Posted: Mon Mar 23 12:51:51 1987
Date-Received: Sat, 28-Mar-87 05:07:31 EST
Sender: darrell@sdcsvax.UCSD.EDU
Lines: 61
Approved: mod-os@sdcsvax.uucp

Jim Rees's article (which may or may not have been posted yet) gives
the outlines of how Apollo uses the single-level store idea to implement
its distributed filesystem.

Something else to note is that we did run into the problem that the mapped
I/O model did present performance problems to the sequential file access
level.  The kernel knows nothing about sequential I/O and open files
and such and originally did not do any read-ahead.  Later (a while ago
now), we added "touch ahead":  Mapped segments (32K regions of virtual
address space) can be marked with an integer that is the number of pages
the pager should read in when any page in the segment gets faulted on.
This give read-ahead of a sort.

Sequential output was a bit trickier.  There is no parallel to page
faulting that occurs when you're "done" writing through a page.  What
we did do was make it so that segments can be marked "flush-behind";
i.e. the physical memory pages should be treated as good candidates for
re-use as soon as the segment mapped over them gets unmapped from the
virtual address space.  Another change we made to make sequential output
better is "grow-ahead":  Note that when you're writing a new file, the
pages that are mapped in do not correspond to real disk pages until they
are touched.  Touching one of these pages is called a "growth fault".
The system now optionally grows the file by more than one page on growth
faults.  The Stream I/O library takes care of truncating off any extra
pages when the stream is closed.

All in all, mapped I/O is a nice and useful idea, but anyone who thinks
that it will perform just like a traditional sequential I/O system withOUT
any special purpose features like the ones described above is whistling
in the dark.  Also, some of the features don't always work like you'd
think/hope.

To deal with cache consistency, the file locking mechanism is tied into
the remote file caching mechanism.  Although it is possible to bypass
the locking mechanism and map a file withOUT locking, this is strongly
discouraged (i.e. not a documented feature and not used by the Stream
I/O library).  When a node locks a file, it contacts the home node of
the file and gets back the current date-time-modified (DTM) for the file.
It uses this value to determine whether any pages the using node has are
still OK (i.e. whether it can avoid re-reading the pages from the home
node).  When the using node unlocks the file, dirty pages are sent back
to the home node before the lock is released.  If you use the mechanisms
as intended, you NEVER get bad (stale) data.  We consider this property
a necessity.

Another thing to remember about the single-level store:  Don't be seduced.
I.e. don't use it for things for which it is not intended.  I.e. just
because you can access some database file all over the network doesn't mean
that you should implement a DBMS this way.  You probably really want
to use RPC.  The problem with using a file system in the "wrong" way
is that the interface to file systems is just generally not designed
to deal with failure -- e.g. to letting you know just how much of your
I/O succeeded when the network partitioned.  Things can be made even worse
in a mapped I/O system since any random memory reference can cause an
exception to be raised.  (At least when you make "normal" filesystem
calls, you generally get an error returned.  You can simulate this with
mapped I/O, but not always cleanly and faithfully.)

                -- Nat Mishkin
                   Apollo Computer Inc.
-------