Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!unmvax!ariel.unm.edu!triton.unm.edu!van From: van@triton.unm.edu (Van Rauch) Newsgroups: comp.unix.ultrix Subject: Re: 5000/200 HANGS intermittently. No error messages. ugh. help. Message-ID: <1990Oct18.213749.29975@ariel.unm.edu> Date: 18 Oct 90 21:37:49 GMT References: <1990Oct16.211229.18767@ariel.unm.edu> Sender: news@ariel.unm.edu (USENET News System) Organization: University of New Mexico, Albuquerque NM Lines: 95 In article <1990Oct16.211229.18767@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes: > >very strange problem with our 5000. About every 3 to 7 days, >the system will HANG. No messages on the console, nothing in >syserr.hostname.# (uerf), no core file in /usr/adm/crash (savecore is >turned on). - nuthin! > I should have rtfm'd before I posted. There exists a doc called: "Starting the Crash Dump Routine Mnaully on RISC Processors" in volume 3 of System and Network Management. As far as I can tell there is a bug in the 4.0 kernel that innocuous user and system processes are tripping over. After spending a few hours with crash vmcore.# vmunix.# a trace on runnable processes at the time of the crash shows different processes that are eventually executing panic and boot instruction, for example: > proc -r SLT S PID PPID PGRP UID PY CPU SIGS EVENT FLAGS ... 80 r 4324 3999 4324 7341 113 255 0 in trace pagi ... > trace 80 Stack trace -- last called first 0 boot (paniced = 0, arghowto = 0) [../../machine/mips/machdep.c: ,545 0x8010 9ea8] 1 panic (s = 80159828) [../../sys/subr_prf.c: ,1159 0x800a3c18] 2 kn02trap_error (ep = ffffdcf8, code = 80112fcc, sr = 0008, signo = ffffdcd4 ... > ps 80 SLOT PID UID COMMAND 80 4324 7341 (sml) > where "sml" is a program made available to students for a cs class. The $60,000 question is, how does one get the text string for the argument to PANIC eg. panic (s = 80159828)? Or more plainly, where do I go from here? The consensus here is that without adb, one can't get it. Does anyone know differently? Each time our 5000 has hanged, a different process leads to the panic and boot. ie. there is no consistency at the csh level for what comamnd is tripping the ?kernel? bug. Without more help from /bin/crash I'm at a loss for how to find the instruction that does the damage. --- And now for someting completely different... cmp different under 4.0 Given two files, foo1 and foo2; foo1 is NONempty and foo2 is empty. And the script, "cmp.csh": #! /bin/csh set x = `cmp foo1 foo2` echo $x echo $x[1] --- under 3.x: ---- fornax.unm.edu:van -> cmp.csh cmp: EOF on foo2 cmp: --- under 4.0 ---- triton.unm.edu:van -> cmp.csh cmp: EOF on foo2 Subscript out of range. This happens because cmp under 4.0 was changed to write EOF diagnostics to std err. instead of std out. Under 3.x EOF diags are written to std out. Yes I'm splitting hairs here, but when your favorite prof comes to you pulling his/her hair out because their homegrown script breaks on the "new" system, it makes you appreciate consistency ;-) --- Van Rauch van@triton.unm.edu Application/Systems University of NM, CIRT