Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 alpha 4/15/85; site mm730.uq.OZ Path: utzoo!linus!philabs!cmcl2!seismo!munnari!basser!uqcspe!mm730!probe From: probe@mm730.uq.OZ (Cameron Davidson) Newsgroups: net.unix-wizards,net.bugs.4bsd Subject: UDA50 and bad blocks (and a bug in dump - 4.2BSD) Message-ID: <114@mm730.uq.OZ> Date: Thu, 1-Aug-85 04:01:15 EDT Article-I.D.: mm730.114 Posted: Thu Aug 1 04:01:15 1985 Date-Received: Wed, 7-Aug-85 00:48:54 EDT Organization: Mining&Metal. Eng; Univ of Qld; Brisbane; Aus Lines: 82 Keywords: UDA50 bad-blocks dump(8) Xref: linus net.unix-wizards:11385 net.bugs.4bsd:1358 This was prompted by arnold@gatech - message <655@gatech.CSNET>, except that it has nothing to do with Ultrix. The experience on our machine (vax730/4.2BSD/RA80) indicates that there may be as many problems in automatically forwarding newly found bad-blocks as there would be avoided by it. As I understand it the UDA-50 controller transparently does bad-block forwarding PROVIDED the blocks were flagged the last time the disc surfaces were formatted. The problem occurs when you start to get previously good sectors reporting hard errors - even though you cannot read the contents of the sector it might not really be a bad block. We recently had a problem on an RA-80 with gradually increasing intensity of soft and then hard errors. The DEC service engineers just said well, if you ran VMS... Each file with a hard error that we found we just hid away and wondered what would happen if it got worse. Then of course a hard error occurred in the inode area of the user filesystem and that was the end of our complacency. While we were messing around trying to get a list of all the bad blocks the root filesystem fell over and a bit later so did the swap area when I was trying to boot the mini-root fs from tape. Then we found out that DEC diagnostics cannot just read a disc to find errors, it must first write known data...well the last backup was fairly recent. The diagnostics laboured happily over the night and reported about 16 sectors which did not write/read correctly, most with more than one offence, and most associated with a specific head. However, of the sectors which UN*X had previously called unreadable, only one appeared as a hard error and two were reporting soft (ECC correctable) errors. Reformatting the disc and adding the bad sector info reported 20 sectors revectored, and then retesting the disc gave a similar number of fresh bad blocks. The problem turned out to be the read/write amplifier board - there was nothing wrong with the head/disc assembly. Lessons: (well they were new to me) 1. BUG IN DUMP: it reads inodes in 8k chunks - fine... but if one sector out of the 16 is unreadable you've lost the lot. By that stage it is probably impossible to recompile dump with a smaller block size. 2. If the software added bad blocks to the hardware revectoring table on its own account then there would have been a race in our case to see whether we first filled up the bad-block table with not-really-bad blocks or clobbered one of the inode blocks. No operating system can survive having its directory structures corrupted (even, I am told, VMS) and if that happens there is nothing to do but a dump/reformat/restore. Until that occurs, and if the errors are in file data areas only, it is a fairly simple matter to allocate sectors with hard errors to dummy files that can be ignored. The main occasion on which it would be nice to be able to add bad blocks to the re-vector table would be if they were in the paging area. If the DEC diagnostics are able to reformat just a given range of cylinders then this would be enough (I can't remember - but certainly the exerciser program can check any given fraction of the disc). Failing this we only need a standalone program to add bad blocks to the table, but I don't suppose DEC are too keen to give out the necesary info. The difficulty inherent in any automatic bad-block table rewriting lies in judging when the unreliability of a given sector becomes intolerable; certainly a single instance of failure which is cured by rewriting it should not be sufficient. This leads to variable criteria depending on the location within the disc partition. I would suggest that the simplest solution to impliment and to use would be a user program allowing manual entry of a block into the re-vector table (all volunteers one step forward please). 3. reliable file systems? We may have umpteen cloned superblocks in 4.2BSD but for a reliable system we would also need duplicated inodes. Try mounting a filesystem with unreadable inodes and see what happens. 4. How do you tell which block is giving the hard error - the "sec no" reported by the error message is actually the STARTING sector number for the transfer (usually multiple sector). It is the "hdr" that reports the real disc sector that went bad. 5. DEC diagnostics cannot report which sectors are currently unreadable (e.g with too many bit errors for ECC). If anybody wants it I now have a trivial program which reads the disc and reports head, cylinder, sector etc of unreadable sectors. (DEC doc. didn't tell me about the half cylinder offset on every second cylinder) Cameron Davidson ACSnet or CSNET: probe@mm730.uq.oz UUCP: ...seismo!munnari!mm730.uq.oz!probe ARPA: probe%mm730.uq.oz@seismo