Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site utcs.uucp Path: utzoo!utcs!wagner From: wagner@utcs.uucp (Michael Wagner) Newsgroups: net.unix-wizards Subject: Re: 4.2bsd kernel auto-nicing, scheduling Message-ID: <1142@utcs.uucp> Date: Thu, 6-Mar-86 11:33:56 EST Article-I.D.: utcs.1142 Posted: Thu Mar 6 11:33:56 1986 Date-Received: Thu, 6-Mar-86 11:40:54 EST References: <1014@brl-smoke.ARPA> <856@inset.UUCP> <3311@sun.uucp> Reply-To: wagner@utcs.UUCP (Michael Wagner) Organization: University of Toronto Computing Services, general purpose UNIX Lines: 81 Summary: I've just been catching up on unix-wizards, which I haven't looked at in a loooong time, and I found this discussion of 4.2 scheduling. Interestingly, the Maryland stuff sounds *very* close to the (rather heuristic) scheduler in CP (the 'monitor', if you will) of VM. It (basically) has: 1) a starting priority for every user 2) no explicit upper/lower limits per user, but certain other constructs generate something like a probability cloud around that starting priority 3) a number of queues that users reside on. Aside from the ones that designate resource unavailability (notably storage overcommitment), there are basically 3 queues for 'runable' users, called (with typical lack of imagination) Q1, Q2 and Q3. They correspond to very interactive, somewhat interactive, and batch (by batch I mean it runs to completion once started, not that you send it off elsewhere). In the HPO (High Performance Option) variant of the product, the issue is clouded (but vastly improved, I'm told...I'll shortly be able to report on this) by the splitting of each of these into waiting-for-CPU-dispatch and waiting-for-other-quick-services (page fault, etc) queues. They are informally called the real-run-list and the run-list, respectively. I can't recall the formal names right now. Interrupts move users between the run-list and the real-run-list, and only the real-run-list need be scanned on most redispatches. 4) A transaction can be loosely defined as the time between pressing the enter key, and getting the last of the results. (for those who don't know, S/370 moves terminal interaction (irrevocably) into front-end hardware, so there is no raw-mode equivalent. The entire input is presented to the mainframe in one interrupt when it is completed. Completion is signalled by a special key (enter). Some (limited) local editing is possible before hitting enter.) When a transaction is started, it is given a 'deadline' based on the basic priority of the user, some acknowledgement of the load average (called the expansion factor in CP), and an initial (recent-history-based, I think) characterization of the transaction. After that, the run lists are sorted and serviced in 'deadline' order. This effectively prevents indefinite postponement (but, as pointed out in earlier postings, indefinite postponement strategies are rendered ineffective when faced with inadequate hardware). 5) While a transaction is 'runable', it is thrown into the pool and gets time-slices in rough proportion to it's deadline priority. The short- term scheduler strategy is a modified round-robin (I think someone once called it a round-turkey :-) ). If a transaction sucks up enough resource, the scheduler decides it was wrong to characterize this transaction as interactive, and it moves it to Q2 (which also has the effect of recalculating it's deadline, because it has now shown itself to be less interactive). A similar shift can occur Q2->Q3. Living in 'lower' queues has the effect of getting you larger slices of resource, but less frequently. There's more, but you get the point (and this is getting long). One of the things I found amazing (coming from a background with more sophisticated schedulers) was how well this all works, in the face of tremendous loads. We used to run 3 large UTS systems (a 370 UNIX-like system) and a smattering of little CMS users (mostly development and maintenance) on our system. Even when the machine ran flat out, editing response for CMS users was uneffected by the load. (There were other problems, but they basically weren't CP's fault, and are really another story). As another example, we recently disconnected one channel path to DASD (in order to provide a test path for a new system being build concurrently with production). That cuts our access to disk in half. No one seems to notice in their editing response. I can see it in the performance reports, but it's minimal. I can also now flog the system with artificially constructed workloads and actually effect other users (something I basically couldn't do when both channels were connected). I think we're seeing the effect of inadequate hardware there, though, and not bad scheduling decisions. One place where this all falls down is that the scheduler basically makes only CPU dispatching decisions. It has no influence on I/O queue decisions, nor paging queue decisions. This all works if you have overdesigned your I/O configuration relative to the rest of the machine, but would fail badly otherwise. This is relatively easy to do with 370 arch. I/O gear, but (seemingly) much harder on VAXEN. I'm curious how this is compensated for in scheduling for 4.2 Well, enough for now. I don't know how interested people on this list are in hearing about work being done on other systems (especially blue ones :-)). Michael Wagner (wagner@utcs)