Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!usc!brutus.cs.uiuc.edu!psuvax1!psuvm!w0l
From: W0L@PSUVM.BITNET (Bill Lasher)
Newsgroups: comp.sys.sgi
Subject: Re: fsck
Message-ID: <89313.153421W0L@PSUVM.BITNET>
Date: 9 Nov 89 20:34:21 GMT
Organization: Penn State University
Lines: 56

Some of you may have been following the fsck question I posted last week.
Thanks to help from several of you, including some people at SGI, I finally
decided the REAL problem was our system administration.  One of the people at
SGI thought the following might be of interest to others, and suggested I post
it.

The original note follows:
=========================================================================
Date:    9 November 1989, 14:16:04 EST
From:    Bill Lasher                (814) 898-6391  W0L      at PSUVM
Subject: Re: fsck, init state 3
To:      dunlap at sgi.sgi.com
In-Reply-To:  dunlap%bigboote.csd AT sgi.com   -- Thu, 9 Nov 89 11:06:55 PST

Our most recent problem (the RPC timeout) I think was caused by the way
we implemented the nightly reboot.  We scheduled them 5 minutes apart,
figuring that would be enough time.  I found out today that one machine
was still in the process of restarting when the YP server he was
communicating with started to reboot.  This caused the system to hang.
Rebooting did in fact clear things up, but it took some time.  Part of
the problem is that the time on each machine is not exactly the same (a
diference of a couple of minutes).  We are going to set all machines to
the same time, and change the reboot interval to 10 minutes.

I think we got thrown off the track because running fsck nightly changed
the total time it took for the systems to reboot, and things just
happened to work out O.K.  Also, we probably weren't patient enough
earlier to let reboot do it's thing; when reboot didn't work, we tried
fsck, which did work because it took longer to finish up, and by the
time it was done the network wasn't as busy (or something like that.)  I
think we were also in a hurry to get things fixed, and as a result got
sloppy (ie, running fsck without unmounting, etc.).

Some of our problems may come back, but we will handle each of them
separately as they occur, and try to be more careful.  I suspect some of
the earlier problems (the full disks, hung spool queues) showed up
because we were letting the systems run for a week at a time without
rebooting, and things just got a little messy.  We had planned from the
beginning to have them reboot every night, but we had too many other
things going on to get it implemented.

We'll just take it from here and see what happens.

Best regards,

Bill
========================================================================
END OF ORIGINAL NOTE

You may not follow all the details, but you probably get the general idea.
I think it's a good example of what can happen when an experienced computer
user gets his first UNIX/networked system.

Bill

"If I knew what I was doing, I wouldn't have had to ask the question!"