Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!decwrl!purdue!mailrus!wasatch!cons.utah.edu!kessler From: kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) Newsgroups: comp.unix.xenix Subject: SCO Xenix System Hang Message-ID: <766@wasatch.UUCP> Date: 10 Dec 88 22:43:11 GMT Sender: news@wasatch.UUCP Reply-To: kessler%cons.utah.edu@wasatch.UUCP (Robert R. Kessler) Organization: University of Utah, Computer Science Dept. Lines: 80 We are having problems with SCO 386 Xenix and are looking for some help. Here is the scenario: Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess multiport board. They run about 6 concurrent users. We have installed the latest version of Xenix (2.2.? -- I don't recall which exactly). Our applications are all written in RM/COBOL supported by Austec. Our customer arrives in the morning around 6 am and starts using a terminal or two. By 9 am they are up to full strength running all 6 terminals. When a user runs our software, they all login with the same user id which starts executing our own user interface shell (written in COBOL). It emulates the user interface that we had on our original system running on TI minis. The user then selects the program to run and our shell uses the COBOL CALL statement to call the program. I don't believe that it actually forks a process to do this, though I might be wrong. Typically, a user seldom logs all the way out and just goes in and out of programs from our shell. Some time after 11:00 am, the display of the various screens all start to slow down. Instead of blasting the fields to the screen, it chunks a line at a time. If someone can get to a terminal with the login prompt, they may be able to log in as shutdown and reboot the system. All user programs can usually save their data (the applications are all doing data base like operations, using the key-indexed files provided by COBOL -- we don't use any external data base facility). However, they cannot exit back out of the programs. They just hang. If they are successful at rebooting the system, everyone comes back up and it works just fine. If not, then the whole system hangs. The hang is interesting. If you have a terminal sitting at the regular sh prompt, you can type carriage returns and the prompt is echoed. If you do any command (ps, shutdown, etc) then it just goes away and doesn't respond. You can still type on the terminal, characters are echoed, but nothing happens. You can also switch to a different screen and bring up the system prompt. However, if you try to type on this screen, nothing is echoed. It seems to be related to the amount of work that gets done. If they then go for another three hours, it will crap out again. Our customers are currently rebooting at 11, 2 and 5 before going home and running their nightly backup, upkeeping programs. It is extremely incovenient. We have three other customers waiting for their systems, but we dont want to send them until we can get this problem fixed. We have contacted SCO a couple of times and Austec, but haven't had any resolution of the problem (there seems to be some finger pointing between the two). Another data point -- I tried to simulate the problem and wrote a program to CALL a couple of programs and exit, etc. I eventually did get code that would always cause the system to hang in exactly the same way (but it doesn't need to do any calls). However, I tracked down the problem and it is some kind of record/file locking problem. The program that eventually causes it to hang essentually opens, and writes to a shared file. It hangs randomly, when one terminal opens or closes and another writes the file. We guarantee that they don't write concurrently to the same records, but it still shouldn't get into a situation where it hangs the entire system. The resulting hang, acts just like the hang at our customer. However, this hang can happen in one minute or 3 hours. It is entirely timing dependent, not load dependent. I believe that my program uncovers another bug, and really isn't what our user is seeing (I tried rewriting the program so the file isn't shared and installed it at the customer -- it didn't help since we still have lots of shared files that are used in the system). Plus the circumstances of being time varying makes me believe that it is a different problem, though the result is the same. (BTW -- the buggy program was run on a COMPAQ 386/20 DeskPRO running 2.3.1 Xenix). Any help would be greatly appreciated. Can I write some logging programs to write useful information to a file that we could examine after a crash? Is there some system parameter that I could tweak to alleviate it? Thanks. Bob.