Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!steinmetz!vdsvax!barnett From: barnett@vdsvax.steinmetz.UUCP (Bruce G Barnett) Newsgroups: comp.unix.wizards Subject: Re: Recovery from swap failure Message-ID: <2461@vdsvax.steinmetz.UUCP> Date: Fri, 4-Sep-87 06:30:08 EDT Article-I.D.: vdsvax.2461 Posted: Fri Sep 4 06:30:08 1987 Date-Received: Sat, 5-Sep-87 20:00:26 EDT References: <2433@vdsvax.steinmetz.UUCP> <691@spar.SPAR.SLB.COM> Reply-To: barnett@steinmetz.UUCP (Bruce G Barnett) Organization: General Electric CRD, Schenectady, NY Lines: 41 Re: my recovery from swap failure. I have enjoyed the few suggestions I have gotten. But I believe that there is no solution with the situation I proposed. Remember - this is with a vendor's simulation program, so I can't hack the sources. ( I will complain to the vendor about check-pointing). If I could, however, there is still a problem of recovery from a swap failure. To wit: Swap partition = 100 Meg Job A runs for 20 hours - allocates (say) 80 Meg . . . Job B (but same program as A) starts up, allocates 19 Meg . . . Job A needs 2 Meg more virtual memory - fails - aborts - riots start Without check-pointing, it does no good for Job A to suspend. Job B will continue, suspend, and then Job C will start, suspend, etc. Perhaps the software could detect a malloc failure, and given some parameter specified by the user, suspend or abort the job ( small jobs abort, big jobs suspend - or oldest job suspends, newest job aborts). As it turns out - we have a viable solution - multiple simulation machines! I will most likely implement: All simulaton jobs go into a queue Big jobs going to the large machine Small jobs going to the big system if idle Otherwise, the small system(s). Someone here has MDQS, which I will look into. Any (additional) ideas or suggestions will be appreciated. -- Bruce G. Barnett uunet!steinmetz!barnett