Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!clyde.concordia.ca!nstn.ns.ca!news.cs.indiana.edu!att!linac!uwm.edu!rpi!zaphod.mps.ohio-state.edu!think.com!hsdndev!cmcl2!lanl!jlg From: jlg@lanl.gov (Jim Giles) Newsgroups: comp.arch Subject: Re: Ignorance speaks loudest (was:Computers for users not programmers) Message-ID: <14696@lanl.gov> Date: 14 Feb 91 21:39:17 GMT References: <1991Feb14.195906.5726@news.arc.nasa.gov> Organization: Los Alamos Natl Lab, Los Alamos, N.M. Lines: 30 From article <1991Feb14.195906.5726@news.arc.nasa.gov>, by lamaster@pioneer.arc.nasa.gov (Hugh LaMaster): > [...] > It is an open question as to whether what used to be called "recovery of > rolled jobs", "user checkpointing", etc. really makes sense anymore. It > was a good idea when the MTBF of a CPU was four hours, and we had on-site > C.E.'s to fix the hardware in a hurry. With MTBF's of *months* on many > systems, I'm not sure it is a good idea. How many people frequently have > a long running job die in a hardware related crash now? [...] Admittedly, this is not very frequent. However, the machine may go down for _scheduled_ maintenence of dedicated time on a daily basis. Automatic crash recovery is an important feature of the system which allows the systems people to bring the machine down at will and not lose anything belonging to the users (except time). > [...] How many of those > jobs hadn't modified any files yet? [...] On our system, NONE of the dropfiles are out of sync with their I/O. The dropfile is the system's swap image - if this even _can_ be out of sync with the files open to the process, not even the system would be able to safely use them. That's the advantage of dropfiles: they make a feature, that the system is already providing, directly available to the user. They are updated every time the program swaps. They can be left behind as restartable images after the program fails - just patch the cause of the failure and restart it. They don't require extra disk space since the system has to allocate space for a swap image anyway. J. Giles