Xref: utzoo comp.realtime:739 comp.os.os9:1070 Path: utzoo!attcan!uunet!mcsun!cernvax!herbert From: herbert@cernvax.UUCP (herbert walseth) Newsgroups: comp.realtime,comp.os.os9 Subject: Re: Time Problem on OS-9 Systems Message-ID: <2091@cernvax.UUCP> Date: 13 Jul 90 13:53:03 GMT References: <13391@shlump.nac.dec.com> Followup-To: comp.realtime Organization: CERN, European Laboratory for Particle Physics Lines: 61 In article <13391@shlump.nac.dec.com> mcculley@alien.enet.dec.com writes: > >In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com (michael.j.knudsen) writes... >> >>A bug that bites only after 4 months of continuous operation without a reboot? >> [...] >>Says something about OS9/K that such a problem should ever come up. >>Also about the hardware used (except for the clock bug). >>I wonder how many other OSes stay up that long...? >>-- > > [...] > >I *-> expect <-* production systems (hardware and software) to be capable of >staying up indefinitely, unless I do something to cause them to be otherwise. > >Why would anyone expect otherwise? > IMHO, both yes and no. Hardware do have a limited life time and that is something we've got to live with. In particular, hard disks and other devices with moving parts are bound to break down sooner or later. Some might say that a MTBF of 40.000 hours is as good as infinity. But if you have 100 systems continuously running you must be ready to replace a disk every fortnight. On the other hand, I do not _expect_ software to have a _limited lifetime_. In particular real-time software, both OS and applications, that are supposed to run continuously should be able to do this. There should not be a slowly incrementing counter, a slow eating of memory or other time bombs hidden in the system that sooner or later will cause it to crash. This is not acceptable no matter how unlikely the software designer thinks it is that the limit will ever be reached. Sooner or later, someone is going to reach it. And with real-time systems, the result might be disastrous. It is of course very hard to verify that a system is free of time bombs. (Impossible, one might say.) By pushing the system hard in the lab, months of normal operation may be simulated over a weekend or so. But not everything is equally easy to simulate. This thread was started by a posting where I asked for help with a clock problem <2031@cernvax.UUCP>. It turned out that due to a bug in the clock driver, the the clock would start to slow down after around four months of continuous operation. It is not obvious to me how a user can protect himself against problems like this, and this particular bug thought me a lesson. PS. Sorry if I post this twice, something funny happend the first time I tried. (And the system has only been up for two days :-).) +----------------------------+-------------------------------------------+ | | | | Herbert Walseth | No problem is so big or so complicated | | herbert@cernvax.cern.ch | that it can't be run away from. | | | | +----------------------------+-------------------------------------------+