Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!spool.mu.edu!munnari.oz.au!metro!usage.csd.unsw.oz.au!newt.phys.unsw.OZ.AU!mcba From: mcba@newt.phys.unsw.OZ.AU (Michael C. B. Ashley) Newsgroups: comp.unix.ultrix Subject: help: random crashing of DS5000 running ULTRIX 4.1 Message-ID: <1606@usage.csd.unsw.oz.au> Date: 27 May 91 00:22:33 GMT Article-I.D.: usage.1606 Sender: news@usage.csd.unsw.oz.au Lines: 53 Hi, This is a rather long message describing a problem I have with a machine crashing. If anyone could shed some light on a possible solution, I would be most grateful. I have a DS5000/200PX running ULTRIX 4.1 (Rev. 52), and the machine crashes an average of once a day. The symptoms of the crash are that the system does not respond to keyboard entry, or to /etc/ping from another machine. If the screen saver is activated, the console screen remains off despite mouse movement or keyboard presses. If the screen saver is not activated then the screen remains on (no error messages visible), and the mouse will move the cursor. No error messages are observed in the output of /etc/uerf. The machine (and memory and disks) pass every diagnostic in /usr/field, and I have run the hardware (V5.3) ROM tests for days at a time without picking up any errors. Last week the system board was replaced, however, the problem remains. Once I noticed a message similar to "swap error" appearing in the Session Manager message area at the instant of a crash. As far as I can see my swap space is configured correctly (about 300 MBytes of swap for 48 MBytes of memory). I have tried rebuilding the kernel a few times with minor changes, all with no effect. Running /etc/sec/auditd doesn't show up anything unusual at the time of the crash (although the buffering of auditd would probably prevent the interesting information being written to disk). The machine will run without crashing if I disconnect the ethernet. The crashes aren't related to some user's program, since there aren't any users other than root at the moment. Needless to say this is a very frustrating problem, can anyone make any suggestions as to what I should do next? I have two ideas: (1) Maybe my copy of ULTRIX is corrupt. It came from a TK50, a rather unreliable medium in my experience. I have run /etc/stl/fverify to try and check the files, and everything appears to be OK although it is difficult to be sure since the *410.inv files show lots of checksum errors since they have been overwritten by *411.inv files. (2) Since the crashes appear to be related to the ethernet, maybe I need the "ln*.o kernel fix" that has been mentioned recently with respect to using tcpdump and LAT with ULTRIX 4.1. Note that our ethernet is teaming with exotic packets from all sorts of machines, and regularly crashes a couple of VT1000's we have in the building (they die with "illegal opcode 28", despite a recent ROM upgrade, but that is another story ...). Thanks for any suggestions! Michael Ashley mcba@newt.phys.unsw.oz.au