Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!pacbell!pbhya!whh
From: whh@pbhya.PacBell.COM (Wilson Heydt)
Newsgroups: news.sysadmin
Subject: Re: "Morris did it"--the new excuse?
Message-ID: <21673@pbhya.PacBell.COM>
Date: 19 Nov 88 18:18:20 GMT
References: <552@comdesign.CDI.COM> <1570@valhalla.ee.rochester.edu> <16915@agate.BERKELEY.EDU>
Organization: Pacific * Bell, Oakland, CA
Lines: 47

In article <16915@agate.BERKELEY.EDU>, weemba@garnet.berkeley.edu (Obnoxious Math Grad Student) writes:
> 
> When I've taught courses that use computers, I told students that under
> almost all circumstances, computer downtime would not be an excuse for
> lateness.  The one exception I've ever made involved granting everyone
> a week's extension.  I've never worked assuming that the machines I use
> are 100% reliable.  Do the scientists/researchers at your site do so--
> even on critical stuff?  If someone has a grant proposal riding on get-
> ting something done by a certain deadline, what happens if there's a
> major disk crash at your site?

Where I work machines are known not to be 100% reliable--but we try to
come as close as we can.  The project I'm on has agreements with our 
operations group to provide for 98% up-time during scheduled hours.
The last status report i got had warning on it because it was only
98.04%.  Generally they do much better--many months it has been 100%.
If the application isn't available I usually start getting calls within
a few *minutes*--and the worm caused outages of hours to *days*.  I don't
have any direcet operations responsibility--but I'd be answering a lot of
questions if anything near that severe happened.

When there is a major disk-crash, backups of the data-sets are loaded
to a work pack (descreasing the available work space, temporarily) and
the application--or system, depending on which pack--is brought back up.
This should not take more than 30 minutes.

The application I work on is small--we have only about 150 users and I
*personally* wrote about 70% of the code in it (there's no one else to
blame!).  You ought to see the care given to *important* systems! What
we're trying to achieve is the reliability of our major customer system.
(Hint--when was the last time your 'phone failed to work?  Was it the
handset or the system?)

> Hospitals generally have a backup power supply.  For a very good reason.

The system I have at home has a UPS on it--I consider it cheap insurance.

>             But now you cite computers where users cannot afford to have
> computers to be down for long--do the sites that run them without having
> any contingency plans whatsoever?  Such sites are irresponsible.

We have contigency plans.  Every company I've ever worked for has had
disaster planning.  I've been through two actual computer "disasters".
One was a "flood" (on the 13th floor) and the other created an actual
risk of explosion in the computer room.

   --Hal