Path: utzoo!utgpu!news-server.csri.toronto.edu!smoke.cs.toronto.edu!moraes Newsgroups: comp.sys.sgi From: moraes@cs.toronto.edu (Mark Moraes) Subject: Re: 4D-200 series hangs frequently Message-ID: <91Jun29.232126edt.1042@smoke.cs.toronto.edu> Organization: Department of Computer Science, University of Toronto References: <84172@bu.edu> <1991Jun21.022810.7255@fido.asd.sgi.com> Date: 30 Jun 91 03:21:45 GMT Lines: 46 For what it's worth, we have a 4D/240 and a 4D/340 that hang frequently too. (frequently == every day at the worst of times, once every four days at the best; we've been waiting for them to pass the canonical "can it stay up and working for 30 days" test for over a year now :-) (Took our Sun4/280 a year and a half to reach this blissful state, so this must be something about modern OSes) Both systems run 3.3.1, neither has a graphics console. Both have their console serial lines wired to a Develcon Develswitch, so we can get at them remotely when need be. Both act as fileservers for diskless Sun3s, have users login from our terminal server, and from X terminals or workstations. The 240 runs a non-standard Ciprico disk controller and driver, so it's possible that the problems on it are our fault. However, the 340 is standard SGI hardware and software, just a few kernel constants (like streams buffers) cranked up (and a couple of streams code fixes that solved some of the more frequent hangs) Typical hang conditions are when the system has a dozen users or so -- the 340 usually has all four processors busy with crunch jobs in the background, the 240 usually has a couple of processors idle. Both systems are frequently pushed to the limit, the 340 more often than the 240 (the 240 hangs more often, though) which must be some part of the problem, because other less loaded 240s and 280s around here have stayed up for months on end. Both systems have a lot of NFS hard mounts, including cross-mounts. We're well aware that an NFS server going down can hang them, but there have been many hangs that cannot be explained by this. (We also mount NFS directories in /nfs/machine/filesystem to try to avoid some of the problems) I've seen some correlation between hangs and a home directory file system filling up. Not too conclusive, though. (Both machines have a reasonable amount of swap -- 200Mb or so; we're well aware that a process filling up swap can degrade the system impressively as it does its best to make the process dump core:-) I know we ought to have reported this in more detail before, but we've been embarassingly sluggish about collecting enough facts to make this sort of report useful to the kernel folk we contact at SGI, and calls to the hotline about this sort of problem produce, um, less than helpful answers once we confirm that we have lots of space in /tmp, we age logs regularly, and have lots of swap space. Mark.