Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!shelby!agate!dog.ee.lbl.gov!elf.ee.lbl.gov!torek
From: torek@elf.ee.lbl.gov (Chris Torek)
Newsgroups: comp.unix.admin
Subject: Re: Problem with dump
Message-ID: <10036@dog.ee.lbl.gov>
Date: 19 Feb 91 11:01:28 GMT
References: <529@mesrx.UUCP> <2528@autodesk.COM> <18621@cbmvax.commodore.com>
Reply-To: torek@elf.ee.lbl.gov (Chris Torek)
Organization: Lawrence Berkeley Laboratory, Berkeley
Lines: 64
X-Local-Date: Tue, 19 Feb 91 03:01:28 PST

>>In article <529@mesrx.UUCP> bbraden@mesrx.UUCP (Bill Braden) writes:
>>>  DUMP: (This should not happen)bread from /dev/rra0c [block 69872]:
>>>count=8192, got=-1

>In article <2528@autodesk.COM> stevel@Autodesk.COM (Steve Litras) writes:
>>... According to our Sun engineer, it's is a fairly harmless problem
>>(soft error) ....

In article <18621@cbmvax.commodore.com> grr@cbmvax.commodore.com
(George Robbins) writes:
>Well, there's no way it's a "soft" error,

Depending on definitions and exact circumstances, it could be; but read on:

>the data that dump was trying to read isn't [read], and junk is written
>on the tape.  In some cases, it may be that the data is "don't care"...

>There seem to be several major causes for the problem.
>One is an actual read error [on the disk drive] ...

This is probably the most common cause.  Since many read errors can be
recovered simply by persistence, you may be able to get a good copy of
the file or block in question.  The drive should be repaired and/or the
bad sector forwarded.

To find the name of the file, use icheck -b <block number> followed by
ncheck -i <inode number>.  The <block number> you need for icheck is the
number shown in square brackets (here 69872).  See icheck(8) and ncheck(8)
for details.

Note that it is possible that the block is the final block of a large
file (one that no longer ends in a fragment) and that the data dump could
not read may be irrelevant.  You should still fix the problem, before
the file is extended in place.

>Another is when a filesystem has been corrupted such that there are
>pointers outside the partition in the structure.  Running fsck should
>find/"fix" this sort of problem.

Correct, unless the reason the file system appeared to be damaged was
a synchronization error caused by dumping a `live' file system (one that
is actively being modified).  In this case a second dump will not show
the error (hence it can be called `soft').

>Another cause that has been mentioned from time to time is when a filesystem
>completely fills a partition and the partition size isn't multiple of some
>magic number of blocks (I forget the exact excuse).  In this case when the
>partition fills, dump tries to do a multi-block read of the last chunk of
>data and fails because the multi-block region crosses the partion boundry.
>If the block in error corresponds to one of the last blocks in the partition,
>this might be your problem.

Dump is (and has been since 4.2BSD; the comment is signed `mkm 9/25/83')
smart enough to `back off' after such an error; it will not complain
about `bread from %s [block %d]...'.  An over-eager kernel hacker might
have a driver that logs the attempt to read past the end of the partition,
but dump itself will recover.

Along the same lines, fsck can fail on a block file system under 4.2BSD
if a partition size is not a multiple of 2048 (BLKDEV_IOSIZE) bytes.
This was fixed by the time of the 4.3BSD-tahoe distribution.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov