Path: utzoo!utgpu!water!watmath!clyde!ima!cfisun!rich From: rich@cfi.COM (rich) Newsgroups: comp.bugs.4bsd Subject: Bus error crash during dumps Keywords: Sun, server, crash Message-ID: <57@mars.UUCP> Date: 13 Jan 88 20:09:02 GMT Organization: Consumer Financial Institute, Waltham, Mass. Lines: 69 We have recently encountered a bug in the kernal, apparently documented by Sun in the May 1987 Software Technical Bulletin, that causes our file server (a 3/280) to crash with a bus error while running a full system dump using a third-party tape backup system. I am looking for more information about the bug; in particular what causes it and how we might get around it. Now for the details. We have a Sun network with two servers (a 3/180 with a single Eagle disk and a 3/280 with two Super Eagles), five 3/160s (dual 70MB disks), a 3/60 (single 70MB disk), and three 3/50s (diskless). They are all running SunOS 3.4. We perform system backups using the UBACKUP package from Unitech Software (a nice package, by the way). The package uses the Sun-supplied tar (it can use cpio) at the lowest level to perform the dumps (full dump on the weekend, incrementals during the week). We dump the entire network (with some excluded directories) through NFS mounts. This backup system worked fine for about 8 months. About a month ago, we decided to change the way we access remote machines, reducing the mount list from over 80 entries to about 15 and increasing the use of symbolic links. After making this change (no physical files were moved), we tried to perform a full dump, and we started getting the bus error. After Sun took a look at our core dump, they determined that we were encountering a reported bug: Ref# 1004002 in the May '87 STB (page 134). They also said that the bug has been fixed in 4.0. The synopsis is: "*crfreelist in kern_prot.c gets trashed." The description is: "When doing extensive ethernet/disk activity (time of occurrence ranges from 2 to 12 hours) the system may trap on a bus error condition." The crash usually occurs near the beginning of the third tape (sure enough, about 2 1/2 hours into the dump), but not always at the same place in the file system. It does not crash when dumping individual machines (e.g., /remote//u and /remote/