Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!linus!decvax!harpo!seismo!hao!hplabs!sri-unix!mike@brl-vgr From: mike@brl-vgr@sri-unix.UUCP Newsgroups: net.unix-wizards Subject: Re: nice Message-ID: <4019@sri-arpa.UUCP> Date: Mon, 15-Aug-83 02:58:56 EDT Article-I.D.: sri-arpa.4019 Posted: Mon Aug 15 02:58:56 1983 Date-Received: Wed, 10-Aug-83 16:47:57 EDT Lines: 225 From: Mike Muuss This does not answer the exact question posed ("how does the 4.1 scheduler work"), but does discuss an alternative scheduler for UNIX. The system under discussion below is BRL/UNIX, a "High-Performance" V6 derivative, which is still in use today (and supports 25-30 simultaneous users on an 11/70, and has TCP/IP support, etc). It is my intention to implement a somewhat more sophisticated version of this scheduler for 4.2 BSD. People who would be interested in hearing more about this project should contact me in early September. - - - - - - - - From the archives - - - - - - - - - - - - - - - Date: 16 Jun 82 20:30:03-EDT (Wed) From: Michael Muuss To: Harry at Brl-Bmd, Steve at Brl-Bmd, Earl at Brl-Bmd cc: Brian Harvey , Gurus at Brl-Bmd, Software at Brl-Bmd, BobS at Brl-Bmd, Rucker at Brl-Bmd Subject: Notes on the BRL/UNIX Scheduling Algorithm Based upon a request for information by Earl, I have put together the following discussion: The UNIX Scheduler that we use in BRL/UNIX is most definitely NOT time-sliced. It is a strict priority based scheduler, with premption. So, the question then boils down to how is the priority computed. Here is some information that I have written about this: SCHEDULING Taking first the case where there is no swapping, it is the task of the scheduler to allocate the CPU to the process with the highest priority on the run queue. Whenever an event has occured that a process was waiting in the kernel for (ie waiting in psleep()), pwakeup() is called and the process is placed on the run queue, with the priority specified when psleep was called. runrun is incremented, so whenever all the current interrups have been processed, and any kernel execution has finished, rather than returning the CPU to the user process, a context switch occurs (magic in the machine assist). Since the newly awakened process, by convention, has a negative (ie good) priority, it will get the CPU (assuming no other processes with a higher priority). Whenever the system service is completed (sys-call is finished, signal posted, whatever), and the process is to resume executing in user mode, the priority is recomputed using setpri(), so that it competes fairly with other processes. SWAPPING As currently implemented, the swapper exists as a totally separate entity from the scheduler, and in fact is itself scheduled as a process. The swapper is process zero, hand crafted by the kernel. While total separation between swapping and scheduling is probably not the best approach, it does work quite well, and makes the code much easier to understand and maintain. (Note that the routine sched() is actually the SWAPPER, and that swtch() is the SCHEDULER. Poor naming.) The swap algorithm that we use (which is different in many of the details from the BTL algorithms, but still is similar), goes like this: IF there are processes swapped out, THEN consider what to do ELSE wait for somebody to be swapped out (usually by expand() or xnewproc()) and check again. Get information on the hightest (best) priority process that is swapped out. If sufficient core is availible as things stand now, go right ahead and swap him in, and go back to the beginning. No free core. We know the size of the process to be swapped in, so loop through the process table looking for "easy core" -- that is, processes which are in core, but are executing a sys-wait call, or are in ptrace-stop state. Swap out "easy core" processes until there are no more of them, or we have swapped out as much space as the process comming in requires. Note that this approach ignores the problem of memory fragmentation, and merely hopes that things will work out. However, the estimation approach reduces the proc table search from order N^2 (old way) to order N, which can be very helpful with large proc tables. So, after swapping out the estmated amount of "easy core", try to swap the best swapped out process on the run queue in (because of the elapsed time, this may no longer be the same process as in the first time through). The code above is repeated. If "enough" easy core could not be found, then we seek the "worst" process in core. If the worst in core is still better than the best on the swap device, forget it, and wait a while. ALSO, if the "worst" process in core has been loaded for less than two seconds, forget it. This is the only anti-thrashing code that we use. INTERACTIONS on MEMORY RICH Systems Where the amount of memory on a system is sufficient to hold most of the members of the run queue, the swapping algorithm above tends to work quite well, clearing memory of "deadwood" process just sitting around in sys-WAIT state, keeping memory utilization at around 75%, which seems about ideal (there is usually space for forks and expands directly in core). This points out a lossage: the memory allocator and expand() don't have the ability to check for additional core just off the end of a process, so a lot of needless core-shuffling happens. This is on the list of things to be fixed. In the memory rich scenario, the scheduler controls the allocation of the CPU, because it can (almost) always find something to run. INTERACTIONS on MEMORY POOR Systems When the memory requirements of the run queue greatly exceed availible memory (a frequent problem on all but 11/44 and 11/70 systems), the allocation of the CPU is controled primarily by the swapper. The scheduler will run whatever is in core, but keeping the right procs in core is the real trick here. Because the swapper is strictly priority based, it will try to keep the "right" processes in core; the only difficult case is when there are several processes exhibiting identical behavior; they will tend to "round robin" at the same priority. Having the decay ratios for incore and onswap different is important here. The 2-second anti-thrashing constant prevents the system from going into a total thrash, causing an orderly "staging" of processes through memory. Furthermore, where ever possible, I try to arrange to have swapping take place on either (a) a high transfer rate disk also used for filesystems or (b) a moderate speed disk dedicated to swapping (RK05, etc). This strategy tends to induce some bandwidth limitations on swapping (~10 swaps/sec for dedicated RK05) so that core shuffling is contained. It is important to note that (in my opinion) the behavior of "staging" processes though memory is the "proper" approach to this problem, if you agree with the premise that the intent of the system is to provide good interactive response, regardless of what happens to "crunchers". While this does tend to keep swapping activity fairly high, it also provides reasonable service to interactive processes, which must be distinguished from traditional thrashing, where little useful work is accomplished because of all the memory shuffling. INTERACTIONS on MEMORY STARVED Systems If there is not enough room for 2 goodly sized user processes in the machine (quite possible for 18-bit address space machines with less than full compliment), and large programs are being run, then there may not be much hope, and the swapping proceedure degenerates into one of "popping" procs in and out of core, one at a time, which is the best that can be done. In a subsequent letter, I'll discuss how the bounds for the three scheduling parameters are derived (and what they are), and how they affect scheduling, swapping, and the interaction between the two. ---------------------------- Briefly, the computation of priority depends on several factors: core usage: every 20 blocks (10Kb) in p_size of core costs 1 "nice" point. block io: every 16 blocks in p_iocnt costs 1 "nice" point. User-supplied NICE factor: 1 "nice" point per point, up or down. CPU Usage: 1 nice point for every 16 ticks (0.256 ms) in p_cpu. DISCUSSION: The above formula is complicated by the fact that p_iocnt and p_cpu are not what they seem. Rather than representing a measure of instantaneous loading, they include a decaying "history" factor as well. For each second that passes, the following computations are made for every process in the system: Every time the value in u.u_utime (User CPU usage, not inclusing u.u_stime, which is the CPU cost of doing I/O) crosses a 1-minute boundary, the p_nice value is irrevocably incremented by 1, to provide a long-range decay of "Number Crunchers". [Every clock tick (0.016 ms) of cpu time is recorded by incrementing p_cpu by one, and the process in marked with the "SWTCHED" flag] Every second the process is in core, the p_time field is incremented by HZ to indicate core-residency time. Value is peak-limited to 90 seconds. Now, to compute the p_cpu "decay" factor, the proper RATE must be selected, as follows: rate = INRATIO iff SWTCHED is set, ie the process has received CPU time in the previous second, rate = NSELRATIO if SWTCHED is not set but SLOAD is, ie the process is loaded, but not run, rate = OUTRATIO if SWTCHED is not set and SLOAD is not set, ie process is swapped out, and got no CPU time. After selecting the proper rate, p_cpu = p_cpu * rate / 100; where the rate is expressed as a percentage (ie between 0 and 100). p_iocnt = p_iocnt * diskratio / 100; So, the entire scheduler really depends on the settings of the four parameters inratio, nselratio, outratio, and diskratio. These factors control the "memory" of the past "behavior" of every process. On the BMD70, these factors are presently set to the following values: inratio=88%, nselratio=70%, outratio=60%, diskratio=40%. These values may be interogated and changed dynamically by using the program "schp". Some important relationships exist between these numbers; for a memory-rich system, inratio > nselratio > outratio, so that the memory of behavior is better for the people in core getting time. This tends to give a share of the machine in the near future to those processes which have not gotten time in the immediate past, helping to level the load. The setting of the diskratio value has not been studied much; the present setting was done mostly by intuition. The current values of the cpu time ratios were originally determined by a simulation of the algorithm, and then modified slightly under actual loading tests. I presented a short paper on this at the 1st Toronto USENIX conference; I'll try to dig up all the details and send them on in (yet) another message. The current algorithm begins to have difficulties when the machine is overcommitted by a factor of four or more, as processes which are marked as "hogs" never get any CPU time at all. It is not clear that the scheduler should handle this case; a bigger computer is a better approach. Best, -Mike