Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!mailrus!ncar!ico!rcd From: rcd@ico.isc.com (Dick Dunn) Newsgroups: comp.unix.sysv386 Subject: Re: Reliability of (Sys V) file systems on power failure Summary: this is going to be tough, but let's try it for a moment... Message-ID: <1990Sep26.192446.22110@ico.isc.com> Date: 26 Sep 90 19:24:46 GMT References: <5869@suns302.cel.co.uk> <1990Sep22.041723.1599@pilikia.pegasus.com> <1990Sep26.153108.550@naitc.naitc.com> Organization: Interactive Systems Corporation, Boulder, CO Lines: 70 I had opined that you shouldn't see file system damage on a power hit, and also noted that I hadn't seen damage (beyond files being written during the hit) for quite a few years. karl@naitc.naitc.com (Karl Denninger) writes: > Ok, I've seen filesystem damage of this type, on your Operating System > (2.0.2), and another employee here has seen the same thing on his copy of > ISC 2.2. > > To put it bluntly, there's something wrong that should be fixed. This sort of thing is tough to work out without a lot of detail, but since Karl has said, "the gauntlet has been thrown down" let's see if we can make some progress on it here. I'm game--I don't want to find out "the hard way" that there are ways to take major damage from a power failure, so if Karl has seen it, I'd like to learn from it. > OK, so why did my /etc/default/boot file get whacked a few months back when > we had a power failure? ... > (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY, > you can't boot the machine!) "Whacked" is a little too technical for me just yet. Do you mean that it ended up empty, or missing entirely? After you recreated it and got the system back up, did anything like the boot file show up in lost+found? Was the rest of /etc/default OK, or did it take out the whole directory? Here's what I'm trying to get at: If the file was corrupted or gone, something got written that shouldn't have been written. The first task is to find out what got written. The sort of reasoning goes like this: - If /etc/default (the directory containing boot) got corrupted, I'd want to know what ended up there, because that directory shouldn't be subject to change during "normal" system operation. - If the inode for boot got corrupted, you'd expect a chunk of inodes (one disk sector) to get it...and it's likely that other files would be hit also. The boot parameter file is likely to share its inode sector with other files that are "important" but seldom modified. An access-time update could have been in progress when the power failed. If it toasted a full sector, you'd expect to see other important files damaged or gone. > Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic > that disables the write gate when power goes out of safe margins). Sounds good so far. What's the box? If you've built it up from parts, then what's the motherboard? As you can guess, I don't yet see cause to say that either hardware or software is either guilty or innocent. Again, if something got corrupted, it means that something got written that shouldn't have been written. The problem--and it's NOT likely to be an easy one--is to find out what was written wrong. That's likely to give a clue whether it's hardware or software (or a conspiracy of the two:-). > Ok Mr. Dunn, the gauntlet has been thrown down. If you want details of the > failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these > hits) you're welcome to call me here. I don't follow the connection to SunOS4.1--correct me if I'm wrong, but I didn't think there was a hardware platform common to ISC's Sys V.3.2 and SunOS. (386i???) (My OS??? Let's clarify: I do use ISC systems, both at work and at home. I'm taking an interest in this because I want to know how and why the failures you've seen can happen--it's an important question. But I'm not speaking for ISC on the net.) -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...Worst-case analysis must never begin with "No one would ever want..."