Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!unmvax!ariel.unm.edu!triton.unm.edu!van
From: van@triton.unm.edu (Van Rauch)
Newsgroups: comp.unix.ultrix
Subject: Re: 5000/200 HANGS intermittently. No error messages. ugh. help.
Message-ID: <1990Oct18.213749.29975@ariel.unm.edu>
Date: 18 Oct 90 21:37:49 GMT
References: <1990Oct16.211229.18767@ariel.unm.edu>
Sender: news@ariel.unm.edu (USENET News System)
Organization: University of New Mexico, Albuquerque NM
Lines: 95

In article <1990Oct16.211229.18767@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes:
>
>very strange problem with our 5000.  About every 3 to 7 days, 
>the system will HANG.  No messages on the console, nothing in 
>syserr.hostname.# (uerf), no core file in /usr/adm/crash (savecore is 
>turned on). - nuthin!
>

I should have rtfm'd before I posted. There exists a doc
called:

"Starting the Crash Dump Routine Mnaully on RISC Processors"
in volume 3 of System and Network Management.

As far as I can tell there is a bug in the 4.0 kernel 
that innocuous user and system processes are tripping over.

After spending a few hours with crash vmcore.# vmunix.#
a trace on runnable processes at the time of
the crash shows different processes that are 
eventually executing panic and boot instruction, for example:

> proc -r
SLT S   PID  PPID  PGRP  UID  PY CPU   SIGS    EVENT FLAGS
...
 80 r  4324  3999  4324 7341 113 255      0              in trace pagi
...

> trace 80
Stack trace -- last called first
   0 boot (paniced = 0, arghowto = 0) [../../machine/mips/machdep.c: ,545 0x8010
9ea8]
   1 panic (s = 80159828) [../../sys/subr_prf.c: ,1159 0x800a3c18]
   2 kn02trap_error (ep = ffffdcf8, code = 80112fcc, sr = 0008, signo = ffffdcd4
...
> ps 80
SLOT   PID   UID   COMMAND
  80  4324  7341    (sml)
>

where "sml" is a program made available to students for a cs class.

The $60,000  question is, how does one get the text string for 
the  argument to  PANIC eg.  panic (s = 80159828)? Or more 
plainly, where do I go from here?

The consensus here is that without adb, one can't get it.
Does anyone know differently? 

Each time our 5000 has hanged, a different process leads to the 
panic and boot. ie. there is no consistency at the csh level for 
what comamnd is tripping the ?kernel? bug. Without more 
help from /bin/crash I'm at a loss for how to find
the instruction that does the damage.

---

And now for someting completely different...

cmp different under 4.0

Given two files, foo1 and foo2; foo1 is NONempty and foo2 is empty.

And the script, "cmp.csh":

#! /bin/csh
set x = `cmp foo1 foo2`
echo $x
echo $x[1]

---
under 3.x:
----
fornax.unm.edu:van -> cmp.csh
cmp: EOF on foo2
cmp:

---
under 4.0
----
triton.unm.edu:van -> cmp.csh
cmp: EOF on foo2
Subscript out of range.

This happens because cmp under 4.0 was changed to 
write EOF diagnostics to std err. instead of std out. Under
3.x EOF diags are  written to std out. 
Yes I'm splitting hairs here, but 
when your favorite prof comes to you pulling his/her
hair out because their homegrown script breaks on the "new" 
system, it makes you appreciate consistency ;-)
---
Van Rauch			van@triton.unm.edu
Application/Systems
University of NM, CIRT