Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!cs.utexas.edu!sdd.hp.com!elroy.jpl.nasa.gov!ames!dftsrv!mimsy!chris From: chris@mimsy.umd.edu (Chris Torek) Newsgroups: comp.arch Subject: fork, spawn, vfork: an overview Message-ID: <25904@mimsy.umd.edu> Date: 5 Aug 90 23:39:07 GMT References: <920@dgis.dtic.dla.mil> <5830@titcce.cc.titech.ac.jp> <1990Jul24.194313.3258@esegue.segue.boston.ma.us> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 127 Since there seems to be a great deal of confusion and ignorance surrounding the `vfork issue', it might help if I describe how these things work. The basic idea behind `fork' is quite simple: make an exact duplicate of the running process---copy all the user information and all the kernel information, except of course the process ID and special internal-only implementation data---and then return from the fork in the old process giving the new process's ID, and return from the fork in the new process giving a 0 (so that it can tell it is the clone rather than the original). This operation tends to be expensive since it involves copying data, potentially a great deal of data. There are a number of solutions: a. Avoid fork entirely (the spawn issue): `If the copy is merely being made so that the clone can throw away its user data and run another program, why bother copying?' This is a sensible idea, but every time someone tries to do it, the spawn call winds up being terribly complex. Perhaps this is because the wrong people do it; but perhaps it is because the spawn approach is inherently flawed. I am not going to argue one way or the other here. b. Try not to copy. Anything that can never be changed (e.g., the program's code or `text' section) obviously need not be physically copied. (Unix fork() calls have always shared text.) In addition, if the hardware supports it (e.g., via copy-on-write or copy-on- reference), the data and stack segments also need not be copied. This is not always the best policy; sometimes the expense of copying on write or reference outweighs the savings from never copying some pages. There are some tricks that can be applied here as well (e.g., always copy the current stack page, leave the rest copy-on-write). c. Cheat. This is the `vfork' approach. Even if fork can be done via copy-on-write, on some machines it remains expensive for large processes, because the page tables must be copied anyway and these can also be large. (Some of the pages of PTEs can be shared, but this rapidly gets tricky.) And of course, there are machines on which copy-on-write cannot be done, either because the hardware never supported it (e.g., where writes cannot be restarted) or because of a bug in the microcode (e.g., the infamous VAX-11/750 bug); here copy-on-reference can be used, but it tends to be more expensive (most processes read many more locations than they write). Thus, the `virtual fork' trick. Suppose that, instead of making a copy of the data and stack segments *and* the kernel per-process information, the kernel were to make a copy of just the kernel per-process information, then suspend the parent process entirely, `handing over' the entire virtual memory image to the clone (by giving the original's PTE pages as well as its data and stack segments to the clone)? The clone would then be running in the very same virtual (and physical) memory that the original had before it `virtually forked'. Nothing has been copied except the kernel data. The main issue here is that, if the original process were allowed to run, both the parent and the child would be competing for the same data and stack memory---much like threads, but disastrous since the stacks would rapidly collide---so the parent must be marked as having no memory at all, and cannot run while the child uses the space. The child, too, must be marked specially, so that when it is ready to give up its virtual space (via exec() or exit()), the kernel knows instead to `hand back' the entire virtual memory to the parent. This is how vfork works. Only the kernel data (the u. area) are copied. The old virtual memory is not so much `shared' as `handed over' to the child: it uses the exact same resources the parent process had. When the child process is done with them, they are `handed back' (using the same kernel routine, in fact; it is called `vpassvm') and the parent is allowed to continue. There are several problems with this scheme: - Since the child process (the clone) has already returned from the vfork() system call, the stack frame needed for this return has been wiped out. This is handled by avoiding the return in the first place: The vfork call is written in assembly, and pops the stack frame before doing the system call at all. On return from the system call the assembly code jumps to the place a return would have gone, having pre-set the registers as appropriate. - Since the old VM is handed over to the child, any changes the child makes show up in the parent as well. Occasionally some software winds up depending on this (e.g., csh). - Since the parent is suspended, the order of execution (child runs to exec() or exit()) is defined. Occasionally some software winds up depending on this as well (again, csh is the prime example). - Although the kernel data are copied (so that changes made in the child to, e.g., the signal dispositions are not reflected in the parent), some of these describe the virtual memory resources. When the child hands the virtual memory back to the parent, the parent process must account for any changes the child has made to the virtual memory itself---for instance, if the child has made the heap or stack larger, the parent must update its size information. This is a good place for bugs to hide. Buggy user programs aside, it is always possible for a kernel to implement vfork as a true fork, if this is somehow cheaper. Thus there can never be an efficiency argument in favor of fork; vfork never costs more. Moreover, even given a copy-on-write fork, vfork may still be cheaper, because vfork does not have to copy PTEs. On a machine like the Butterfly, a vfork helps because the parent process is suspended: the child can run in the parent's space until it exits or, more likely, gets another processor to do an exec. In essence, vfork is `fork with a promise to finish soon' or even `spawn, but let me do a few things before I go'. When you come right down to it, the real disadvantage to vfork is that, conceptually, it is downright ugly. Viewing it as `spawn, but let me run a few instructions first' helps a lot here. The meaning of pid = vfork(); if (pid == 0) { (void) dup2(fd, 0); (void) close(fd); (void) execve(path, av, env); _exit(1); } ... is not `make a copy; if this is the child, fiddle with descriptors and run something else' but rather `run something else, but first fiddle with descriptors using the following code'. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris (New campus phone system, active sometime soon: +1 301 405 2750)