Xref: utzoo comp.unix.questions:25715 comp.unix.sysv386:683 Path: utzoo!attcan!uunet!aplcen!haven!mimsy!chris From: chris@mimsy.umd.edu (Chris Torek) Newsgroups: comp.unix.questions,comp.unix.sysv386 Subject: Re: Reliability of System V 1K file system Keywords: System V, reliability, file systems Message-ID: <26688@mimsy.umd.edu> Date: 24 Sep 90 18:00:06 GMT References: <5869@suns302.cel.co.uk> <1990Sep22.041723.1599@pilikia.pegasus.com> <1990Sep23.184158.841@hq.demos.su> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 68 In article <1990Sep23.184158.841@hq.demos.su> avg@hq.demos.su (Vadim G. Antonov) writes: >Practically all machines provide power fail interrupts - I know of a number that do not; but in any case: >and I don't know why Unix device drivers have no "xxpwfail" entries. power failure interrupts are not any good unless they are guaranteed to occur sufficiently early, and usually they are not. The power supply system on the main CPU (the thing that has a `power fail' interrupt) is quite often completely independent of the power supply for the disk drives. If the electronics on the drive are in an indeterminate state, nothing done at the CPU will guarantee anything. >Anyway I'm quite sure *any* device can correctly handle power fails - >if you handle device properly :-). Power fail handling is like lightning protection: you can only do so much; if Nature is out to get you, you are doomed. (Lightning strikes have been known to dance teasingly around all the grounding posts, giggle in circles round and round as your hair stands on end, then viciously zap straight into the heart of your computer. Well, maybe not quite. :-) ) Designing systems that act properly on power failure is, however, tricky: >It seems to me the best way to protect disks from accidental >damaging by power fails is to start recalibrating or moving >heads to landing zone - usually quite simple logic circuitry >protects from writing while heads move. Let me tell you about ... Century Data Systems T-300s. (To be fair, I am not sure where the problem was located. The T300s were merely the end of the chain.) We have a couple of Xerox file servers with big washtub drives. These drives have a power fail system that retracts the heads (quite reasonably) so that they will not land on the disk when it stops spinning. Apparently it turns off the write current at the same time, because a simple power failure does not damage anything. Unfortunately, there are not-so-simple power failures. Thunderstorms (those thing that Californians never see :-) ) often cause momentary power failures---anywhere from a fraction of a second to several seconds. (Power distribution systems have thing called `lightning arrestors' that temporarily open the circuit to prevent serious overvoltages. There are two major variants, air and oil. Lightning will jump a simple air gap so the air versions blow compressed gas across the gap. I know nothing about the oil versions, other than that they explode very prettily, like oil-filled transformers. :-) ) Anyway, as it happens, under certain conditions the T300s would detect a power failure and begin retracting the heads. Then the power would come back on, the electronics would think, `oh, everything is OK', and the write current would turn on---while the heads were still spiraling down the pack. The result was invariably a hopelessly damaged pack. Hundreds of `bad' sectors appeared in a spiral pattern, and the only means of recovery was to reformat (followed by a tedious restore from the backup server, all the while hoping desperately that another storm would not come up during the multi-hour restore). The problem has finally been fixed: the file servers are now on a ten-minute UPS, and only a long-term power failure---the kind the drives were engineered to handle---will get through. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris