Xref: utzoo comp.os.os9:1053 comp.realtime:725
Path: utzoo!attcan!uunet!midway!ncar!asuvax!cs.utexas.edu!usc!apple!snorkelwacker!bloom-beacon!eru!luth!sunic!mcsun!cernvax!herbert
From: herbert@cernvax.UUCP (herbert walseth)
Newsgroups: comp.os.os9,comp.realtime
Subject: Time Problem on OS-9 Systems
Summary: RTC slows down after long period of operation
Message-ID: <2031@cernvax.UUCP>
Date: 3 Jul 90 09:02:26 GMT
Followup-To: comp.os.os9
Organization: CERN, European Laboratory for Particle Physics
Lines: 112


Help!

After more than four months of uninterrupted operation, the real time
clock on our OS-9 systems is suddenly causing us problems.  Below is an
attempt to describe the situation.  We are quite stuck with the problem
and would highly appreciate all kinds of advise from the net.

I am cross posting this to comp.realtime and comp.os.os9, both general
and OS-9 specific comments are welcome.

[Sorry about the length of the posting, but I'm trying to include
all relevant information and that is not easy when you don't know 
where the problem lies.]

The systems read the time from a real-time clock card when they boot. 
From then it is up to the OS-9 software to keep the time correct. 
This worked fine and the time was stable for months. But during the 
last two weeks, the time on three of the systems have suddenly started 
to slow down. They are all steadily loosing approximately ten minutes per
day.

There might be several obvious reasons for this. We did have a similar
problem on one of our systems before and that was due to a hardware 
problem on the cpu card. But we find it most unlikely that three cards 
should break down almost at the same time, more than a year after 
the installation.

The malfunction might also be caused by some problems with the power 
supply, the area went through a period of heavy thunderstorms some 
time ago. But our systems are powered from LARGE batteries that should 
filter out all spikes from the mains. Sudden changes in temperature 
etc. are also quite unlikely 100 meters under ground.

The three systems are standing several kilometers away from each other 
and are not directly connected in any way. They have been doing exactly 
the same tasks during this period and no operational changes that 
can explain the situation have been made.

We are not moving our VME crates around at high speeds.

After eliminating all more or less obvious faults that we could think 
of, we went to the less likely ones. The only possibility that we 
have left, and that we would be very interested in getting some feedback 
on, is if the long uninterrupted period of operation may have caused 
an overflow somewhere in the OS-9 software and that this might cause
the system to slow down.

All systems have the same software installed. There are around 25 
processes running concurrently and although they are sleeping most 
of the time, some of them have accumulated 100+ cpu hours and made 
some hundred million calls to the kernel. There are also one or to 
short-lived processes forked every second. This should have accumulated 
to around 15-20 million forks by now. Our installed systems are running
OS-9 Version 2.2.

What really made us suspect the software is the following: During 
a shutdown of the accelerator we rebooted one of the systems, used 
setime to correct the time on the second and left the last one untouched. 
After this, the one that was rebooted started to run normally, while 
the two others continued to loose time.

We find it quite hard to debug this problem. Obviously we cannot plug 
in a logic analyser etc. since a reboot will cure (hide) the problem. 
But the systems have a floppy so we can install test programs. I have 
checked system global variables like ticks per second, ticks since 
reboot etc. and they all look OK. 

What really made me confused is the following experience: I made a 
small program that reads the ticks since reboot and seconds left until 
midnight. It then sleeps for 0x144510 ticks, or 3 hours, and calculates 
the number of ticks and seconds that have passed while it was asleep. 
On all systems, the expected 0x144510 ticks have passed, but the number 
of seconds that have passed according to the D_Seconds variable is 
different. The amazing thing is that on the systems where the clock 
is running incorrectly, the elapsed time is 0x2a30 seconds or 3 hours 
as it should be. On the systems where the clock is correct, however, 
0x2a82 seconds have passed!

Here is a description of the hardware:
The cpu board has a 68020 cpu, 68881 fpcp, 1 MByte of SRAM and 512 
KByte of EPROM. There is also some RAM and ROM on a real-time clock 
card. (This clock is only used to set the time when the system boots.) 
The systems have a 40 MByte hard disk and a floppy disk drive (Not 
used during normal operation.) The cpu card and disk controller are 
made by PEP Modular Computers. There are some inhouse made I/O card 
and a VME bus status card. The systems are connected to the rest of 
the world through a Mil-1553B line. 

Any of the components might of course break down and cause problems,
but why should it happen on three systems at the same time? And how 
could a reboot solve this problem?

The other systems that do not have the same problems are both less 
heavily loaded and have been running continuously for a shorter time 
period.


This is the best description of our problem that I am able to come up
with.  We would be most thankful if you have any ideas about what we could
look for (and how to look without rebooting).  We would also like to hear
from other groups who have had their OS-9 systems running for a similar
period and who have/have not seen the same problem.

Thank you in advance.

--
	     Herbert Walseth, herbert@cernvax.cern.ch
	     TIS Division, CERN, CH-1211 Geneva, Switzerland
	     Phone: +41 22 767 2634, Fax +41 22 785 2208

   No problem is so big or so complicated that it can't be run away from