Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!rochester!cornell!batcomputer!rpi!usc!cs.utexas.edu!uunet!munnari.oz.au!manuel!aerodec.anu.edu.au!csc
From: tridge@aerodec.anu.edu.au (Andrew Tridgell)
Newsgroups: comp.unix.ultrix
Subject: really weird filesystem problem
Message-ID: <1005@aerodec.anu.edu.au>
Date: 1 May 91 16:48:18 GMT
Sender: news@newshost.anu.edu.au
Organization: Australian National University, Canberra, ACT, Australia
Lines: 104

I have recently struck a really weird file system problem under Ultrix
4.0 on a DS3100. The root of the problem was a "CANNOT READ BLK :
567664" reported from fsck. This prevented unattended rebooting as a
reboot required manual operation of fsck, which is a real pain. It also
meant that whenever the file at 567664 was accessed I got a kernel
panic! for example a "ls -l" in the directory would cause a panic.

We have a rz56 disk (600Mb) with /dev/rz1a as root, /dev/rz1g as /usr
and /dev/rz1g as /u. Here is the disktab for a rz56 in case you don't
have one

rz56|RZ56|DEC RZ56 Winchester:\
        :ty=winchester:ns#54:nt#15:nc#1632:\
        :pa#32768:ba#8192:fa#1024:\
        :pb#131072:bb#4096:fb#1024:\
        :pc#1299174:bc#8192:fc#1024:\
        :pd#292530:bd#8192:fd#1024:\
        :pe#292530:be#8192:fe#1024:\
        :pf#550274:bf#8192:ff#1024:\
        :pg#567666:bg#8192:fg#1024:\
        :ph#567668:bh#8192:fh#1024:


Read on for more weirdness :
	The problem started when I moved a few files from /usr to /u.
Soon afterwards I noticed that doing "ls -l" in a certain subdirectory
where I'd been moving from caused a panic (it was a subdirectory of
/usr/include I think). I moved this directory to /usr/BAD and removed
all read and write permissions from it to prevent users crashing the
system. Next time we rebooted I got this:

** /dev/rz1g
** Last Mounted on /usr
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames

CANNOT READ: BLK 567664
CONTINUE? 

Continuing did no good. At this stage a check in the errorlog showed
nothing.

I tried using rzdisk to reassing the block with no luck, rzdisk said the
block was OK and asked if I would like to continue anyway. I tried it
both ways with no change to the situation.

Next I got DEC to lend us another identical brand new disk. With both
mounted side by side I did a newfs on the news disk (all 3 partitions)
then with the new disk's g partition mounted as /nusr I did this
	
	dump 0f - /usr | (cd /nusr ; restore xf - )

to transfer everything to the new disk (I repeated this for the a and h
partitions, on /nroot and /nu)

It didn't work! Everything transferred OK but fsck reported the same
problem on the new g partition. Thus dump/restore had taken the problem
with it to the new disk. This was a real surprise to me. It showed the
problem was not hardware but was in fact software.

My next step was to use tar to transfer instead of dump/restore,
thinking that tar doesn't save any inode info but only transfers files
and permissions. I did this

	newfs /dev/rrz1g rz56
	fsck /dev/rz3g 			#(which reported no problems)
	(cd /usr ; tar -cf - . ) | ( cd /nusr ; tar -xf - )

Once again the problem was transferred to the new disk! A fsck on the
new disk reported the same error as above. Now I was really confused, if
it was a file system error then how did tar transfer it? Tar only knows
about things like filenames, ownership and permissions? Yet it reported
the same block number(567664) ? 

Now I got really desperate.  I ran newfs with the -v option on
/dev/rrz3g so I could see how it was calling mkfs, then I ran mkfs
manually with 6 less sectors. So I changed the second parameter in the
call to mkfs from 567666 to 567660. The bad block is thus excluded from
the file system. I then used tyar as above to try yet again to transfer
the stuff. Success! fsck reports no problems on the new g partition!

BUT (there has to be a but)
I now have an inconsistancy between the partition table and the file
system. The partition table thinks there is 567666 sectors, the file
system thinks there is 567660. Could this cause a problem in the future?
I asked dec support and they don't think so, but maybe......

Basically my questions to the net are :
	
	- what caused the problem in the first case
	- how can dump/restore or tar transfer a bad block between drives?
	- have I now got a time bomb waiting to go off?

I have the second drive for 2 days. If you ask me to experiment with
something then it must be BEFORE the weekend. After that I'm back to 1
drive and I am not willing to try anything esoteric.

Thanks!

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Andrew Tridgell                 CSLab, Research School of Physical Science
tridge@aerodec.anu.edu.au       Australian National University
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-