Xref: utzoo comp.protocols.nfs:1901 comp.arch:21303
Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!stanford.edu!rutgers!bellcore!messy!mo
From: mo@messy.bellcore.com (Michael O'Dell)
Newsgroups: comp.protocols.nfs,comp.arch
Subject: Re: Incremental sync()s and using disk idle time
Message-ID: <1991Mar8.142031.9098@bellcore.bellcore.com>
Date: 8 Mar 91 14:20:31 GMT
References: <28975@cs.yale.edu> <PCG.91Feb28195347@odin.cs.aber.ac.uk> <PCG.91Mar4210505@aberdb.cs.aber.ac.uk> <1991Mar5.223443.21187@ns.uoregon.edu> <1991Mar6.003008.9131@bellcore.bellcore.com> <1991Mar7.115154.4820@hq.demos.su>
Sender: usenet@bellcore.bellcore.com (Poster of News)
Reply-To: mo@messy.bellcore.com (Michael O'Dell)
Organization: Center for Chaotic Repeatabilty
Lines: 81

Line-eater fodder.

"laser-guided line-eater" fodder.
 
Incremental sync() is a good idea in some ways, but not all.

When George Goble did the famous dual-headed vax with 32 megabytes of
memory, one of the first things noticed was that once every 30 seconds,
things got very slow as update flushed out memory.  One of the things
he did was to put a clock-hand style thing in the kernel so the equivalent
of update could push out the pages in a slow, steady stream, instead of
the gigantic clumps of dirty disk blocks.

However, the assumption that there is any disk idle time is basically wrong.
On large, heavily-loaded systems, the disks run pretty constantly.
This isn't to say that drizzling things out at a controlled rate rather
than in big lumps isn't useful, but sometimes it doesn't help, either.
One real problem is that the mapped file may have semantics which require
the user program to not terminate until the write to disk has finished
because the program wants to be quite certain the data is out there.
And if the references to the file were quite random (like a large database
hash table index), then there's a good chance that incremental page pushing
did NOT clean some substantive fraction of the dirty pages, still producing
the impulse load on the disk queue.

One important observation is that no VM system can run faster than it can
get clean pages - if there is an adquate supply in memory, then fine.
Otherwise, how quickly you can turn the pages to disk is the limiting factor
for overall throughput (jobs completed per hour, or conversely, time for
a single large VM-bound job).  This factor most directly effects the
level of multiprogramming a given system can sustain, assuming the workload
isn't all just small jobs which trivially fit in memory and do no substantial
I/O. (IBM MVS I/O systems are often tuned to maximize the sustainable
paging rate without Trashing. Think about it.)

One serious elevator anomoly in fast machines is as follows.  Assume several
processes trying to read different files, each one broken into several
reasonably large chunks (could easily be extents, so that doesn't fix it).
Further, assume that one process's file is broken into several more chunks
than the others with these smaller chunks spread over the same distance
as the other files.  Finally, assume the machine is enough faster than the
disks that each process can complete its processing BEFORE the heads
can seek past the next request.  So, using the standard elevator which
resorts the queue on every request, the process with the large number
of small extents (fortuitously layed-out in request order) will completely
monopolize the disk!  Because the standard elevator will let you "get back
into the boat" even after it's left, the process gets its data, and
spins out another request before the head finishes another request in the
same neighborhood, so it zooms to the front of the elevator queue and
gets to do another read.  The poor processes with their blocks toward the
end of the run starve to death by comparsion. (Note we are assuming a
one-way elevator; it's worse with a 2-way.)  What do you do?  One scheme
used successfully is "generation scheduling".  This collects an elevator-load
of requests, sorts them, and then services them WITHOUT admitting
anyone to the car along the way.  This is a way of insuring fairness.
It also turns out that this scheme can be modified to eleviate SOME of
the problems with the big memory flush problem.  Getting the details
right is complex, but the general approach is as follows.  There is a
"generation counter" which increments at some rate like 2-5x a request
service time.  Each request is marked with the generation when it
arrives. Further, you use a 2-dimensional queue, sorting the subqueues
by generation and within the subqueue keeping the original FIFO arrival order.
(There is some discussion about sorting the sub-queues. jury is still out.)
You now load the elevator car across the subqueues, always getting at least
one request from every generation pending, more with some weighting
function like queue length. [You can implement "must do in this order"
requests by including a special generation queue which always loads the
car first and is not sorted in the car.]   A further enhancement is
to add priorities to requests.  FOr instance, LOW, MEDIUM, and HIGH,
plus FIFO looks good.  LOW is page turning when the pager isn't frantic
for clean pages, MEDIUM for normal reads and writes which you want to
complete soon, and HIGH for things like directory searches in namei()
and some metadata updates, and FIFO for critical metadata updates.
Couple this with the full generational scheduling described above
and you will go a long way toward making the low-level stuff stable
in  the face of large (and truly unavoidable) impulse loads....

Oh yes - much of the thinking about this low-level scheduling stuff
came from Bill Fuller, Brian Berliner, and Tony Schene, then of Prisma.

	-Mike O'Dell