Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!rutgers!labrea!decwrl!spar!hunt From: hunt@spar.SPAR.SLB.COM (Neil Hunt) Newsgroups: comp.unix.wizards Subject: Re: Recovery from swap failure Message-ID: <691@spar.SPAR.SLB.COM> Date: Wed, 2-Sep-87 16:45:33 EDT Article-I.D.: spar.691 Posted: Wed Sep 2 16:45:33 1987 Date-Received: Sat, 5-Sep-87 07:12:21 EDT References: <2433@vdsvax.steinmetz.UUCP> Reply-To: hunt@spar.UUCP (Neil Hunt) Organization: Schlumberger Palo Alto Research - CASLAB Lines: 25 In article <2433@vdsvax.steinmetz.UUCP> barnett@steinmetz.UUCP (Bruce G Barnett) writes: > > We have a machine (Sun 3/260) dedicated to simulations (HILO). >Someone submits a large job that will take several hours and it >allocated quite a bit of virtual memory. Now someone submits a small >job on the same machine, which allocates some more memory. > > Now the large job runs out of swap space, and aborts after >running for 20 hours. People scream, etc. > > I could write a queuing daemon, but people don't want to wait >for a 24 hour job to complete until their 20 minute job starts. > Here is a suggestion: Assuming that it is [mc]alloc which is failing when you are out of swap space, try writing an alternative version of malloc which doesn't return NULL when memory cannot be allocated, but which prints a warning on the console, and suspends the process. A human, an operator, or a deamon could then wait until the system became less loaded, and restart the stopped process, which would succeed in allocating memory and proceed as if nothing had happenned. Neil/.