Xref: utzoo comp.realtime:739 comp.os.os9:1070
Path: utzoo!attcan!uunet!mcsun!cernvax!herbert
From: herbert@cernvax.UUCP (herbert walseth)
Newsgroups: comp.realtime,comp.os.os9
Subject: Re: Time Problem on OS-9 Systems
Message-ID: <2091@cernvax.UUCP>
Date: 13 Jul 90 13:53:03 GMT
References: <13391@shlump.nac.dec.com>
Followup-To: comp.realtime
Organization: CERN, European Laboratory for Particle Physics
Lines: 61

In article <13391@shlump.nac.dec.com> mcculley@alien.enet.dec.com writes:
>
>In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com 
(michael.j.knudsen) writes...
>> 
>>A bug that bites only after 4 months of continuous operation without a reboot?
>> [...]
>>Says something about OS9/K that such a problem should ever come up.
>>Also about the hardware used (except for the clock bug).
>>I wonder how many other OSes stay up that long...?
>>-- 
>
> [...]
>
>I *-> expect <-* production systems (hardware and software) to be capable of
>staying up indefinitely, unless I do something to cause them to be otherwise.
>
>Why would anyone expect otherwise?
>

IMHO, both yes and no.

Hardware do have a limited life time and that is something we've got to
live with.  In particular, hard disks and other devices with moving parts
are bound to break down sooner or later.

Some might say that a MTBF of 40.000 hours is as good as infinity.  But if
you have 100 systems continuously running you must be ready to replace a
disk every fortnight.

On the other hand, I do not _expect_ software to have a _limited lifetime_.
In particular real-time software, both OS and applications, that are
supposed to run continuously should be able to do this.  There should not
be a slowly incrementing counter, a slow eating of memory or other time
bombs hidden in the system that sooner or later will cause it to crash.

This is not acceptable no matter how unlikely the software designer thinks
it is that the limit will ever be reached.  Sooner or later, someone is
going to reach it.  And with real-time systems, the result might be
disastrous.

It is of course very hard to verify that a system is free of time bombs.
(Impossible, one might say.) By pushing the system hard in the lab, months
of normal operation may be simulated over a weekend or so.  But not
everything is equally easy to simulate.  This thread was started by a
posting where I asked for help with a clock problem <2031@cernvax.UUCP>.
It turned out that due to a bug in the clock driver, the the clock would
start to slow down after around four months of continuous operation.  It
is not obvious to me how a user can protect himself against problems like
this, and this particular bug thought me a lesson.


PS. Sorry if I post this twice, something funny happend the first time
I tried. (And the system has only been up for two days :-).)

+----------------------------+-------------------------------------------+
|                            |                                           |
|  Herbert Walseth           |   No problem is so big or so complicated  |
|  herbert@cernvax.cern.ch   |   that it can't be run away from.         |
|                            |                                           |
+----------------------------+-------------------------------------------+