Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!rutgers!bellcore!texbell!sneaky!gordon From: gordon@sneaky.TANDY.COM (Gordon Burditt) Newsgroups: comp.unix.xenix Subject: Re: SCO Xenix System Hang Message-ID: <5517@sneaky.TANDY.COM> Date: 19 Dec 88 04:33:17 GMT References: <766@wasatch.UUCP> <329@bilver.UUCP> <832@ramz.UUCP> Reply-To: gordon@sneaky.UUCP (Gordon Burditt) Organization: Gordon Burditt Lines: 79 The described "hang": having the system run very slowly, but having different users use different logins fixes or reduces the problem, is caused by the per-user-id process limit. There are several features that contribute to this: 1. There is a limit to the number of processes a non-root user may have running at one time, called MAXUPRC. The default is probably something like 15. If you can re-link the kernel, you can probably raise this limit. (I am working from an old version (non-*86) of SCO Xenix System III, so some of this may have changed). Raising this limit does not increase the size of any tables. If you need lots of processes (and this problem will exist regardless of what uid's they run under), you may need to increase the number of entries for processes, open files, and inodes. If you run out of open files or inodes, you get cryptic console messages like "no file". If you run out of system processes, you may get error messages (csh), or just lots of retrying (sh). Fix: don't run everything under the same uid, and/or raise MAXUPRC. 2. If the "sh" shell gets an error on a "fork" due to running over the MAXUPRC limit, it retries. Forever, unless interrupted by a signal. For a quick test of this, log in, then type sh repeatedly. After about 15 or 20 times, you won't get another prompt. Use your interrupt character to unlock the terminal. Then type lots of control-D's to get rid of all those shells. Now, imagine three users logged in under the same user id. Each has 5 processes, and is trying to create a 6th with sh. None of them will get any work done until one of them aborts the retries and terminates that shell. Whether or not the interrupt character can do this, and whether trying will destroy data, depends on the application. The scenario this retrying is supposed to handle is waiting for another, independent job started from the same terminal to release its processes (say, a mailer doing background delivery) without requiring more. Fix: Try to arrange jobs to not require deep nesting of processes. 3. To further complicate the situation, some applications don't wait for their children. A process occupies a process slot until its parent (or if its parent dies, its foster parent, process 1) waits for it. If an application keeps spinning off background print jobs, and never waits for them to finish, eventually it will hit MAXUPRC or run the system out of processes. These will show up on a ps as zombie processes with parents other than process 1. A compromise for this might be to allow one outstanding background job, and after spinning off the second one, wait for one of them to finish. Also, doing a wait() with an alarm() set can pick up already-terminated processes without waiting for all of them. Fix: applications should wait for their children. (Also check the status and report problems!) 4. This one gets a little exotic, and may be specific to the system I am using. It doesn't apply to systems that do paging instead of swapping. It also isn't related to running everything under one user id. There is a limit to the maximum amount of memory a process can use at one time in the kernel. Suppose that the kernel is re-configured to raise this limit to above the amount of memory available. (Limit > physical memory - kernel memory. "Available memory" means the maximum amount of memory a process can get without hanging the system, after administrative restrictions are raised.) Now, have an application program request 110% of available memory. This request will fail. Have an application request 100% of available memory plus one allocation unit. This request doesn't fail (but it should). The process gets swapped out and tries to swap back in. In the process, the swapper swaps everything else out. You can't kill the huge process because it needs to swap in to die. Something else running may lock up behind this process, or it may run, but slowly because it keeps getting swapped out. The fix for this is to not let processes get away with requesting that much memory. The easiest way is to lower the "administrative limit" maxprocmem. This may not be present in System V, or it may exist in another form. Gordon L. Burditt ...!texbell!sneaky!gordon