Xref: utzoo comp.protocols.nfs:1901 comp.arch:21303 Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!stanford.edu!rutgers!bellcore!messy!mo From: mo@messy.bellcore.com (Michael O'Dell) Newsgroups: comp.protocols.nfs,comp.arch Subject: Re: Incremental sync()s and using disk idle time Message-ID: <1991Mar8.142031.9098@bellcore.bellcore.com> Date: 8 Mar 91 14:20:31 GMT References: <28975@cs.yale.edu> <1991Mar5.223443.21187@ns.uoregon.edu> <1991Mar6.003008.9131@bellcore.bellcore.com> <1991Mar7.115154.4820@hq.demos.su> Sender: usenet@bellcore.bellcore.com (Poster of News) Reply-To: mo@messy.bellcore.com (Michael O'Dell) Organization: Center for Chaotic Repeatabilty Lines: 81 Line-eater fodder. "laser-guided line-eater" fodder. Incremental sync() is a good idea in some ways, but not all. When George Goble did the famous dual-headed vax with 32 megabytes of memory, one of the first things noticed was that once every 30 seconds, things got very slow as update flushed out memory. One of the things he did was to put a clock-hand style thing in the kernel so the equivalent of update could push out the pages in a slow, steady stream, instead of the gigantic clumps of dirty disk blocks. However, the assumption that there is any disk idle time is basically wrong. On large, heavily-loaded systems, the disks run pretty constantly. This isn't to say that drizzling things out at a controlled rate rather than in big lumps isn't useful, but sometimes it doesn't help, either. One real problem is that the mapped file may have semantics which require the user program to not terminate until the write to disk has finished because the program wants to be quite certain the data is out there. And if the references to the file were quite random (like a large database hash table index), then there's a good chance that incremental page pushing did NOT clean some substantive fraction of the dirty pages, still producing the impulse load on the disk queue. One important observation is that no VM system can run faster than it can get clean pages - if there is an adquate supply in memory, then fine. Otherwise, how quickly you can turn the pages to disk is the limiting factor for overall throughput (jobs completed per hour, or conversely, time for a single large VM-bound job). This factor most directly effects the level of multiprogramming a given system can sustain, assuming the workload isn't all just small jobs which trivially fit in memory and do no substantial I/O. (IBM MVS I/O systems are often tuned to maximize the sustainable paging rate without Trashing. Think about it.) One serious elevator anomoly in fast machines is as follows. Assume several processes trying to read different files, each one broken into several reasonably large chunks (could easily be extents, so that doesn't fix it). Further, assume that one process's file is broken into several more chunks than the others with these smaller chunks spread over the same distance as the other files. Finally, assume the machine is enough faster than the disks that each process can complete its processing BEFORE the heads can seek past the next request. So, using the standard elevator which resorts the queue on every request, the process with the large number of small extents (fortuitously layed-out in request order) will completely monopolize the disk! Because the standard elevator will let you "get back into the boat" even after it's left, the process gets its data, and spins out another request before the head finishes another request in the same neighborhood, so it zooms to the front of the elevator queue and gets to do another read. The poor processes with their blocks toward the end of the run starve to death by comparsion. (Note we are assuming a one-way elevator; it's worse with a 2-way.) What do you do? One scheme used successfully is "generation scheduling". This collects an elevator-load of requests, sorts them, and then services them WITHOUT admitting anyone to the car along the way. This is a way of insuring fairness. It also turns out that this scheme can be modified to eleviate SOME of the problems with the big memory flush problem. Getting the details right is complex, but the general approach is as follows. There is a "generation counter" which increments at some rate like 2-5x a request service time. Each request is marked with the generation when it arrives. Further, you use a 2-dimensional queue, sorting the subqueues by generation and within the subqueue keeping the original FIFO arrival order. (There is some discussion about sorting the sub-queues. jury is still out.) You now load the elevator car across the subqueues, always getting at least one request from every generation pending, more with some weighting function like queue length. [You can implement "must do in this order" requests by including a special generation queue which always loads the car first and is not sorted in the car.] A further enhancement is to add priorities to requests. FOr instance, LOW, MEDIUM, and HIGH, plus FIFO looks good. LOW is page turning when the pager isn't frantic for clean pages, MEDIUM for normal reads and writes which you want to complete soon, and HIGH for things like directory searches in namei() and some metadata updates, and FIFO for critical metadata updates. Couple this with the full generational scheduling described above and you will go a long way toward making the low-level stuff stable in the face of large (and truly unavoidable) impulse loads.... Oh yes - much of the thinking about this low-level scheduling stuff came from Bill Fuller, Brian Berliner, and Tony Schene, then of Prisma. -Mike O'Dell