Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!sdd.hp.com!elroy.jpl.nasa.gov!ames!dftsrv!mimsy!chris
From: chris@mimsy.umd.edu (Chris Torek)
Newsgroups: comp.arch
Subject: fork, spawn, vfork: an overview
Message-ID: <25904@mimsy.umd.edu>
Date: 5 Aug 90 23:39:07 GMT
References: <920@dgis.dtic.dla.mil> <5830@titcce.cc.titech.ac.jp> <1990Jul24.194313.3258@esegue.segue.boston.ma.us>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 127

Since there seems to be a great deal of confusion and ignorance
surrounding the `vfork issue', it might help if I describe how these
things work.

The basic idea behind `fork' is quite simple:  make an exact duplicate
of the running process---copy all the user information and all the
kernel information, except of course the process ID and special
internal-only implementation data---and then return from the fork in
the old process giving the new process's ID, and return from the fork
in the new process giving a 0 (so that it can tell it is the clone
rather than the original).

This operation tends to be expensive since it involves copying data,
potentially a great deal of data.  There are a number of solutions:

 a. Avoid fork entirely (the spawn issue):  `If the copy is merely
    being made so that the clone can throw away its user data and run
    another program, why bother copying?'  This is a sensible idea, but
    every time someone tries to do it, the spawn call winds up being
    terribly complex.  Perhaps this is because the wrong people do it;
    but perhaps it is because the spawn approach is inherently flawed.
    I am not going to argue one way or the other here.

 b. Try not to copy.  Anything that can never be changed (e.g., the
    program's code or `text' section) obviously need not be physically
    copied.  (Unix fork() calls have always shared text.)  In addition,
    if the hardware supports it (e.g., via copy-on-write or copy-on-
    reference), the data and stack segments also need not be copied.
    This is not always the best policy; sometimes the expense of
    copying on write or reference outweighs the savings from never
    copying some pages.  There are some tricks that can be applied
    here as well (e.g., always copy the current stack page, leave the
    rest copy-on-write).

 c. Cheat.  This is the `vfork' approach.

Even if fork can be done via copy-on-write, on some machines it remains
expensive for large processes, because the page tables must be copied
anyway and these can also be large.  (Some of the pages of PTEs can be
shared, but this rapidly gets tricky.)  And of course, there are
machines on which copy-on-write cannot be done, either because the
hardware never supported it (e.g., where writes cannot be restarted) or
because of a bug in the microcode (e.g., the infamous VAX-11/750 bug);
here copy-on-reference can be used, but it tends to be more expensive
(most processes read many more locations than they write).  Thus, the
`virtual fork' trick.

Suppose that, instead of making a copy of the data and stack segments
*and* the kernel per-process information, the kernel were to make a
copy of just the kernel per-process information, then suspend the
parent process entirely, `handing over' the entire virtual memory image
to the clone (by giving the original's PTE pages as well as its data
and stack segments to the clone)?  The clone would then be running in
the very same virtual (and physical) memory that the original had
before it `virtually forked'.  Nothing has been copied except the
kernel data.  The main issue here is that, if the original process
were allowed to run, both the parent and the child would be competing
for the same data and stack memory---much like threads, but disastrous
since the stacks would rapidly collide---so the parent must be marked
as having no memory at all, and cannot run while the child uses the
space.  The child, too, must be marked specially, so that when it is
ready to give up its virtual space (via exec() or exit()), the kernel
knows instead to `hand back' the entire virtual memory to the parent.

This is how vfork works.  Only the kernel data (the u. area) are copied.
The old virtual memory is not so much `shared' as `handed over' to the
child: it uses the exact same resources the parent process had.  When
the child process is done with them, they are `handed back' (using the
same kernel routine, in fact; it is called `vpassvm') and the parent is
allowed to continue.  There are several problems with this scheme:

 - Since the child process (the clone) has already returned from the
   vfork() system call, the stack frame needed for this return has been
   wiped out.  This is handled by avoiding the return in the first
   place:  The vfork call is written in assembly, and pops the stack
   frame before doing the system call at all.  On return from the
   system call the assembly code jumps to the place a return would have
   gone, having pre-set the registers as appropriate.

 - Since the old VM is handed over to the child, any changes the child
   makes show up in the parent as well.  Occasionally some software
   winds up depending on this (e.g., csh).

 - Since the parent is suspended, the order of execution (child runs to
   exec() or exit()) is defined.  Occasionally some software winds up
   depending on this as well (again, csh is the prime example).

 - Although the kernel data are copied (so that changes made in the
   child to, e.g., the signal dispositions are not reflected in the
   parent), some of these describe the virtual memory resources.  When
   the child hands the virtual memory back to the parent, the parent
   process must account for any changes the child has made to the
   virtual memory itself---for instance, if the child has made the heap
   or stack larger, the parent must update its size information.  This
   is a good place for bugs to hide.

Buggy user programs aside, it is always possible for a kernel to
implement vfork as a true fork, if this is somehow cheaper.  Thus there
can never be an efficiency argument in favor of fork; vfork never costs
more.  Moreover, even given a copy-on-write fork, vfork may still be
cheaper, because vfork does not have to copy PTEs.  On a machine like
the Butterfly, a vfork helps because the parent process is suspended:
the child can run in the parent's space until it exits or, more likely,
gets another processor to do an exec.  In essence, vfork is `fork with
a promise to finish soon' or even `spawn, but let me do a few things
before I go'.

When you come right down to it, the real disadvantage to vfork is that,
conceptually, it is downright ugly.  Viewing it as `spawn, but let me
run a few instructions first' helps a lot here.  The meaning of

	pid = vfork();
	if (pid == 0) {
		(void) dup2(fd, 0);
		(void) close(fd);
		(void) execve(path, av, env);
		_exit(1);
	}
	...

is not `make a copy; if this is the child, fiddle with descriptors
and run something else' but rather `run something else, but first
fiddle with descriptors using the following code'.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)