Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!rochester!cornell!batcomputer!rpi!usc!cs.utexas.edu!uunet!munnari.oz.au!manuel!aerodec.anu.edu.au!csc From: tridge@aerodec.anu.edu.au (Andrew Tridgell) Newsgroups: comp.unix.ultrix Subject: really weird filesystem problem Message-ID: <1005@aerodec.anu.edu.au> Date: 1 May 91 16:48:18 GMT Sender: news@newshost.anu.edu.au Organization: Australian National University, Canberra, ACT, Australia Lines: 104 I have recently struck a really weird file system problem under Ultrix 4.0 on a DS3100. The root of the problem was a "CANNOT READ BLK : 567664" reported from fsck. This prevented unattended rebooting as a reboot required manual operation of fsck, which is a real pain. It also meant that whenever the file at 567664 was accessed I got a kernel panic! for example a "ls -l" in the directory would cause a panic. We have a rz56 disk (600Mb) with /dev/rz1a as root, /dev/rz1g as /usr and /dev/rz1g as /u. Here is the disktab for a rz56 in case you don't have one rz56|RZ56|DEC RZ56 Winchester:\ :ty=winchester:ns#54:nt#15:nc#1632:\ :pa#32768:ba#8192:fa#1024:\ :pb#131072:bb#4096:fb#1024:\ :pc#1299174:bc#8192:fc#1024:\ :pd#292530:bd#8192:fd#1024:\ :pe#292530:be#8192:fe#1024:\ :pf#550274:bf#8192:ff#1024:\ :pg#567666:bg#8192:fg#1024:\ :ph#567668:bh#8192:fh#1024: Read on for more weirdness : The problem started when I moved a few files from /usr to /u. Soon afterwards I noticed that doing "ls -l" in a certain subdirectory where I'd been moving from caused a panic (it was a subdirectory of /usr/include I think). I moved this directory to /usr/BAD and removed all read and write permissions from it to prevent users crashing the system. Next time we rebooted I got this: ** /dev/rz1g ** Last Mounted on /usr ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames CANNOT READ: BLK 567664 CONTINUE? Continuing did no good. At this stage a check in the errorlog showed nothing. I tried using rzdisk to reassing the block with no luck, rzdisk said the block was OK and asked if I would like to continue anyway. I tried it both ways with no change to the situation. Next I got DEC to lend us another identical brand new disk. With both mounted side by side I did a newfs on the news disk (all 3 partitions) then with the new disk's g partition mounted as /nusr I did this dump 0f - /usr | (cd /nusr ; restore xf - ) to transfer everything to the new disk (I repeated this for the a and h partitions, on /nroot and /nu) It didn't work! Everything transferred OK but fsck reported the same problem on the new g partition. Thus dump/restore had taken the problem with it to the new disk. This was a real surprise to me. It showed the problem was not hardware but was in fact software. My next step was to use tar to transfer instead of dump/restore, thinking that tar doesn't save any inode info but only transfers files and permissions. I did this newfs /dev/rrz1g rz56 fsck /dev/rz3g #(which reported no problems) (cd /usr ; tar -cf - . ) | ( cd /nusr ; tar -xf - ) Once again the problem was transferred to the new disk! A fsck on the new disk reported the same error as above. Now I was really confused, if it was a file system error then how did tar transfer it? Tar only knows about things like filenames, ownership and permissions? Yet it reported the same block number(567664) ? Now I got really desperate. I ran newfs with the -v option on /dev/rrz3g so I could see how it was calling mkfs, then I ran mkfs manually with 6 less sectors. So I changed the second parameter in the call to mkfs from 567666 to 567660. The bad block is thus excluded from the file system. I then used tyar as above to try yet again to transfer the stuff. Success! fsck reports no problems on the new g partition! BUT (there has to be a but) I now have an inconsistancy between the partition table and the file system. The partition table thinks there is 567666 sectors, the file system thinks there is 567660. Could this cause a problem in the future? I asked dec support and they don't think so, but maybe...... Basically my questions to the net are : - what caused the problem in the first case - how can dump/restore or tar transfer a bad block between drives? - have I now got a time bomb waiting to go off? I have the second drive for 2 days. If you ask me to experiment with something then it must be BEFORE the weekend. After that I'm back to 1 drive and I am not willing to try anything esoteric. Thanks! -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Andrew Tridgell CSLab, Research School of Physical Science tridge@aerodec.anu.edu.au Australian National University =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-