Xref: utzoo comp.os.os9:1053 comp.realtime:725 Path: utzoo!attcan!uunet!midway!ncar!asuvax!cs.utexas.edu!usc!apple!snorkelwacker!bloom-beacon!eru!luth!sunic!mcsun!cernvax!herbert From: herbert@cernvax.UUCP (herbert walseth) Newsgroups: comp.os.os9,comp.realtime Subject: Time Problem on OS-9 Systems Summary: RTC slows down after long period of operation Message-ID: <2031@cernvax.UUCP> Date: 3 Jul 90 09:02:26 GMT Followup-To: comp.os.os9 Organization: CERN, European Laboratory for Particle Physics Lines: 112 Help! After more than four months of uninterrupted operation, the real time clock on our OS-9 systems is suddenly causing us problems. Below is an attempt to describe the situation. We are quite stuck with the problem and would highly appreciate all kinds of advise from the net. I am cross posting this to comp.realtime and comp.os.os9, both general and OS-9 specific comments are welcome. [Sorry about the length of the posting, but I'm trying to include all relevant information and that is not easy when you don't know where the problem lies.] The systems read the time from a real-time clock card when they boot. From then it is up to the OS-9 software to keep the time correct. This worked fine and the time was stable for months. But during the last two weeks, the time on three of the systems have suddenly started to slow down. They are all steadily loosing approximately ten minutes per day. There might be several obvious reasons for this. We did have a similar problem on one of our systems before and that was due to a hardware problem on the cpu card. But we find it most unlikely that three cards should break down almost at the same time, more than a year after the installation. The malfunction might also be caused by some problems with the power supply, the area went through a period of heavy thunderstorms some time ago. But our systems are powered from LARGE batteries that should filter out all spikes from the mains. Sudden changes in temperature etc. are also quite unlikely 100 meters under ground. The three systems are standing several kilometers away from each other and are not directly connected in any way. They have been doing exactly the same tasks during this period and no operational changes that can explain the situation have been made. We are not moving our VME crates around at high speeds. After eliminating all more or less obvious faults that we could think of, we went to the less likely ones. The only possibility that we have left, and that we would be very interested in getting some feedback on, is if the long uninterrupted period of operation may have caused an overflow somewhere in the OS-9 software and that this might cause the system to slow down. All systems have the same software installed. There are around 25 processes running concurrently and although they are sleeping most of the time, some of them have accumulated 100+ cpu hours and made some hundred million calls to the kernel. There are also one or to short-lived processes forked every second. This should have accumulated to around 15-20 million forks by now. Our installed systems are running OS-9 Version 2.2. What really made us suspect the software is the following: During a shutdown of the accelerator we rebooted one of the systems, used setime to correct the time on the second and left the last one untouched. After this, the one that was rebooted started to run normally, while the two others continued to loose time. We find it quite hard to debug this problem. Obviously we cannot plug in a logic analyser etc. since a reboot will cure (hide) the problem. But the systems have a floppy so we can install test programs. I have checked system global variables like ticks per second, ticks since reboot etc. and they all look OK. What really made me confused is the following experience: I made a small program that reads the ticks since reboot and seconds left until midnight. It then sleeps for 0x144510 ticks, or 3 hours, and calculates the number of ticks and seconds that have passed while it was asleep. On all systems, the expected 0x144510 ticks have passed, but the number of seconds that have passed according to the D_Seconds variable is different. The amazing thing is that on the systems where the clock is running incorrectly, the elapsed time is 0x2a30 seconds or 3 hours as it should be. On the systems where the clock is correct, however, 0x2a82 seconds have passed! Here is a description of the hardware: The cpu board has a 68020 cpu, 68881 fpcp, 1 MByte of SRAM and 512 KByte of EPROM. There is also some RAM and ROM on a real-time clock card. (This clock is only used to set the time when the system boots.) The systems have a 40 MByte hard disk and a floppy disk drive (Not used during normal operation.) The cpu card and disk controller are made by PEP Modular Computers. There are some inhouse made I/O card and a VME bus status card. The systems are connected to the rest of the world through a Mil-1553B line. Any of the components might of course break down and cause problems, but why should it happen on three systems at the same time? And how could a reboot solve this problem? The other systems that do not have the same problems are both less heavily loaded and have been running continuously for a shorter time period. This is the best description of our problem that I am able to come up with. We would be most thankful if you have any ideas about what we could look for (and how to look without rebooting). We would also like to hear from other groups who have had their OS-9 systems running for a similar period and who have/have not seen the same problem. Thank you in advance. -- Herbert Walseth, herbert@cernvax.cern.ch TIS Division, CERN, CH-1211 Geneva, Switzerland Phone: +41 22 767 2634, Fax +41 22 785 2208 No problem is so big or so complicated that it can't be run away from