Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!linus!decvax!harpo!seismo!hao!hplabs!sri-unix!mike@brl-vgr
From: mike@brl-vgr@sri-unix.UUCP
Newsgroups: net.unix-wizards
Subject: Re:  nice
Message-ID: <4019@sri-arpa.UUCP>
Date: Mon, 15-Aug-83 02:58:56 EDT
Article-I.D.: sri-arpa.4019
Posted: Mon Aug 15 02:58:56 1983
Date-Received: Wed, 10-Aug-83 16:47:57 EDT
Lines: 225

From:      Mike Muuss <mike@brl-vgr>

This does not answer the exact question posed ("how does the 4.1
scheduler work"), but does discuss an alternative scheduler
for UNIX.  The system under discussion below is BRL/UNIX,
a "High-Performance" V6 derivative, which is still in use today
(and supports 25-30 simultaneous users on an 11/70, and
has TCP/IP support, etc).

It is my intention to implement a somewhat more sophisticated
version of this scheduler for 4.2 BSD.  People who would be interested
in hearing more about this project should contact me in early
September.

- - - - - - - - From the archives - - - - - - - - - - - - - - -

Date:     16 Jun 82 20:30:03-EDT (Wed)
From:     Michael Muuss <mike@brl-bmd>
To:       Harry at Brl-Bmd, Steve at Brl-Bmd, Earl at Brl-Bmd
cc:       Brian Harvey <BH@mit-ai>, Gurus at Brl-Bmd, Software at Brl-Bmd,
          BobS at Brl-Bmd, Rucker at Brl-Bmd
Subject:  Notes on the BRL/UNIX Scheduling Algorithm

Based upon a request for information by Earl, I have put together the
following discussion:

	The UNIX Scheduler that we use in BRL/UNIX is most definitely
NOT time-sliced.  It is a strict priority based scheduler, with
premption.  So, the question then boils down to how is the priority
computed.  Here is some information that I have written about this:

SCHEDULING

Taking first the case where there is no swapping, it is the task of the
scheduler to allocate the CPU to the process with the highest priority
on the run queue.  Whenever an event has occured that a process was 
waiting in the kernel for (ie waiting in psleep()), pwakeup() is called
and the process is placed on the run queue, with the priority specified
when psleep was called.  runrun is incremented, so whenever all the
current interrups have been processed, and any kernel execution
has finished, rather than returning the CPU to the user process,
a context switch occurs (magic in the machine assist).

Since the newly awakened process, by convention, has a negative (ie good)
priority, it will get the CPU (assuming no other processes with a higher
priority).  Whenever the system service is completed (sys-call is finished,
signal posted, whatever), and the process is to resume executing
in user mode, the priority is recomputed using setpri(), so that it
competes fairly with other processes.

SWAPPING

As currently implemented, the swapper exists as a totally separate
entity from the scheduler, and in fact is itself scheduled as a process.
The swapper is process zero, hand crafted by the kernel.
While total separation between swapping and scheduling is probably not
the best approach, it does work quite well, and makes the code much
easier to understand and maintain.  (Note that the routine sched() is
actually the SWAPPER, and that swtch() is the SCHEDULER.  Poor naming.)

The swap algorithm that we use (which is different in many of the
details from the BTL algorithms, but still is similar), goes like this:

IF there are processes swapped out, THEN consider what to do
ELSE wait for somebody to be swapped out (usually by expand() or
xnewproc()) and check again.

Get information on the hightest (best) priority process that is swapped
out.  If sufficient core is availible as things stand now, go right
ahead and swap him in, and go back to the beginning.

No free core.  We know the size of the process to be swapped in, so loop
through the process table looking for "easy core" -- that is, processes
which are in core, but are executing a sys-wait call, or are in
ptrace-stop state.  Swap out "easy core" processes until there are no
more of them, or we have swapped out as much space as the process
comming in requires.  Note that this approach ignores the problem of
memory fragmentation, and merely hopes that things will work out.
However, the estimation approach reduces the proc table search
from order N^2 (old way) to order N, which can be very helpful with
large proc tables.

So, after swapping out the estmated amount of "easy core", try to swap
the best swapped out process on the run queue in (because of the elapsed
time, this may no longer be the same process as in the first time
through).  The code above is repeated.

If "enough" easy core could not be found, then we seek the "worst"
process in core.  If the worst in core is still better than the best
on the swap device, forget it, and wait a while.  ALSO, if the "worst"
process in core has been loaded for less than two seconds, forget it.
This is the only anti-thrashing code that we use.

INTERACTIONS on MEMORY RICH Systems

Where the amount of memory on a system is sufficient to hold most of the
members of the run queue, the swapping algorithm above tends to work
quite well, clearing memory of "deadwood" process just sitting around in
sys-WAIT state, keeping memory utilization at around 75%, which seems about
ideal (there is usually space for forks and expands directly in core).  This
points out a lossage:  the memory allocator and expand() don't have the
ability to check for additional core just off the end of a process, so a
lot of needless core-shuffling happens.  This is on the list of things
to be fixed.  In the memory rich scenario, the scheduler controls the
allocation of the CPU, because it can (almost) always find something to run.

INTERACTIONS on MEMORY POOR Systems

When the memory requirements of the run queue greatly exceed availible
memory (a frequent problem on all but 11/44 and 11/70 systems), the
allocation of the CPU is controled primarily by the swapper.  The
scheduler will run whatever is in core, but keeping the right procs in
core is the real trick here.  Because the swapper is strictly priority
based, it will try to keep the "right" processes in core;  the only
difficult case is when there are several processes exhibiting identical
behavior;  they will tend to "round robin" at the same priority.  Having
the decay ratios for incore and onswap different is important here.

The 2-second anti-thrashing constant prevents the system from going into
a total thrash, causing an orderly "staging" of processes through
memory.  Furthermore, where ever possible, I try to arrange to have
swapping take place on either (a) a high transfer rate disk also used
for filesystems or (b) a moderate speed disk dedicated to swapping
(RK05, etc).  This strategy tends to induce some bandwidth limitations
on swapping (~10 swaps/sec for dedicated RK05) so that core shuffling is
contained.

It is important to note that (in my opinion) the behavior of "staging"
processes though memory is the "proper" approach to this problem, if you
agree with the premise that the intent of the system is to provide good
interactive response, regardless of what happens to "crunchers".  While
this does tend to keep swapping activity fairly high, it also provides
reasonable service to interactive processes, which must be distinguished
from traditional thrashing, where little useful work is accomplished
because of all the memory shuffling.

INTERACTIONS on MEMORY STARVED Systems

If there is not enough room for 2 goodly sized user processes in the
machine (quite possible for 18-bit address space machines with less than
full compliment), and large programs are being run, then there may not
be much hope, and the swapping proceedure degenerates into one of
"popping" procs in and out of core, one at a time, which is the best
that can be done.


In a subsequent letter, I'll discuss how the bounds for the three
scheduling parameters are derived (and what they are), and how they
affect scheduling, swapping, and the interaction between the two.

----------------------------

Briefly, the computation of priority depends on several factors:

core usage:  every 20 blocks (10Kb) in p_size of core costs 1 "nice" point.

block io:  every 16 blocks in p_iocnt costs 1 "nice" point.

User-supplied NICE factor:  1 "nice" point per point, up or down.

CPU Usage:  1 nice point for every 16 ticks (0.256 ms) in p_cpu.

DISCUSSION:  The above formula is complicated by the fact that p_iocnt
and p_cpu are not what they seem.  Rather than representing a measure of
instantaneous loading, they include a decaying "history" factor as well.

For each second that passes, the following computations are made for
every process in the system:

Every time the value in u.u_utime (User CPU usage, not inclusing
u.u_stime, which is the CPU cost of doing I/O) crosses a 1-minute
boundary, the p_nice value is irrevocably incremented by 1, to provide a
long-range decay of "Number Crunchers".

[Every clock tick (0.016 ms) of cpu time is recorded by incrementing
p_cpu by one, and the process in marked with the "SWTCHED" flag]

Every second the process is in core, the p_time field is incremented by
HZ to indicate core-residency time.  Value is peak-limited to 90 seconds.

Now, to compute the p_cpu "decay" factor, the proper RATE must be
selected, as follows:

	rate = INRATIO iff SWTCHED is set, ie the process has received
			CPU time in the previous second,
	rate = NSELRATIO if SWTCHED is not set but SLOAD is, ie the
			process is loaded, but not run,
	rate = OUTRATIO if SWTCHED is not set and SLOAD is not set,
			ie process is swapped out, and got no CPU time.

After selecting the proper rate, p_cpu = p_cpu * rate / 100;
where the rate is expressed as a percentage (ie between 0 and 100).
p_iocnt = p_iocnt * diskratio / 100;

So, the entire scheduler really depends on the settings of the four
parameters inratio, nselratio, outratio, and diskratio.  These factors
control the "memory" of the past "behavior" of every process.

On the BMD70, these factors are presently set to the following values:
inratio=88%, nselratio=70%, outratio=60%, diskratio=40%.  These values
may be interogated and changed dynamically by using the program "schp".

Some important relationships exist between these numbers;  for a
memory-rich system, inratio > nselratio > outratio, so that the memory
of behavior is better for the people in core getting time.  This tends
to give a share of the machine in the near future to those processes
which have not gotten time in the immediate past, helping to level the
load.

The setting of the diskratio value has not been studied much;  the
present setting was done mostly by intuition.  The current values of the
cpu time ratios were originally determined by a simulation of the
algorithm, and then modified slightly under actual loading tests.
I presented a short paper on this at the 1st Toronto USENIX conference;
I'll try to dig up all the details and send them on in (yet) another
message.

The current algorithm begins to have difficulties when the machine is
overcommitted by a factor of four or more, as processes which are marked
as "hogs" never get any CPU time at all.  It is not clear that the
scheduler should handle this case;  a bigger computer is a better
approach.

				Best,
				 -Mike