Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!rutgers!labrea!decwrl!spar!hunt
From: hunt@spar.SPAR.SLB.COM (Neil Hunt)
Newsgroups: comp.unix.wizards
Subject: Re: Recovery from swap failure
Message-ID: <691@spar.SPAR.SLB.COM>
Date: Wed, 2-Sep-87 16:45:33 EDT
Article-I.D.: spar.691
Posted: Wed Sep  2 16:45:33 1987
Date-Received: Sat, 5-Sep-87 07:12:21 EDT
References: <2433@vdsvax.steinmetz.UUCP>
Reply-To: hunt@spar.UUCP (Neil Hunt)
Organization: Schlumberger Palo Alto Research - CASLAB
Lines: 25

In article <2433@vdsvax.steinmetz.UUCP> barnett@steinmetz.UUCP (Bruce G Barnett) writes:
>
>	We have a machine (Sun 3/260) dedicated to simulations (HILO).
>Someone submits a large job that will take several hours and it
>allocated quite a bit of virtual memory. Now someone submits a small
>job on the same machine, which allocates some more memory. 
>
>	Now the large job runs out of swap space, and aborts after
>running for 20 hours. People scream, etc.
>
>	I could write a queuing daemon, but people don't want to wait
>for a 24 hour job to complete until their 20 minute job starts.
>

Here is a suggestion:

Assuming that it is [mc]alloc which is failing when you are out of swap space,
try writing an alternative version of malloc which doesn't return NULL
when memory cannot be allocated, but which prints a warning on the console,
and suspends the process. A human, an operator, or a deamon could then
wait until the system became less loaded, and restart the stopped
process, which would succeed in allocating memory and proceed as if
nothing had happenned.

Neil/.