Path: utzoo!attcan!uunet!nuchat!moray!siswat!buck
From: buck@siswat.UUCP (A. Lester Buck)
Newsgroups: comp.unix.microport
Subject: bad block bug on second disk in V/AT 2.4.0L
Keywords: bad block V/AT 2.4 workaround
Message-ID: <433@siswat.UUCP>
Date: 2 Aug 89 02:57:22 GMT
Organization: Photon Graphics,  Houston
Lines: 110

Some time ago I asked the net about a weird problem I was having with
the following setup:

    Microport System V/AT 2.4.0L
    Adaptec 2372 disk controller
    Seagate 277R - disk 0
    Mitsubishi 535 - disk 1

The Mitsubishi 535 had a few bad spots, they were found by the Microport
surface scan, they showed up on "showbad 1", but they were always ignored
when using a filesystem on the disk, generating irritating console error
messages and randomly losing data.  Unit 0 works perfectly, correctly
mapping out its bad sectors.  Only Bill Vajk (learn@igloo) responded to tell
me he had run across the same problem and was discouraged enough to consider
moving back to 2.3, since he also needed two drives to handle news.

I finally figured out how to work around this problem.  The "obvious"
solution was to make a file and use fsdb to put all the bad sectors
in one file.  Then they are off the free list and never heard from
again (grep -v badfile on backups, etc.).  Unfortunately, this just
wasn't working and I was still getting the "Bad block" messages on
the console on a regular basis.  Here is the full workaround.

The Microport 2.4 disk driver appartently does full track caching on any
access to a track, so the message appears for the one bad sector if
any of that track's sectors are read.  Also, once that track is in
the track cache, further reads from that track might NOT give the
console message if they do not access the one bad sector.  So if you
write a simple program to find when the console message shows up,
to really test it you need to seek away from the suspect sector,
read anything to flush the track cache, then seek back, for each of
the sectors on the suspect track.  You should find every sector on
the track with a bad sector gives a console message, even though most
of the time the data is read correctly.  All of this assumes the
bad block sniffer program uses the raw device to bypass the buffer
cache and the bad block table (which should be active for the block
device, but of course isn't for Unit 1).

I finally used the program at the end of this posting to make a single indirect
block with the list of blocks on the track with the defects.  My disk is a
Mitsubishi 535 RLL with 977 tracks, 5 heads, 26 sectors/track, and
/dev/rdsk/1s2 is a partition starting at cylinder 504.  Before running this,
I mount a newly mkfs'ed filesystem, touch .badtracks in the root directory,
umount the filesystem, start fsdb, find .badtracks inode with 2i.fd, change
a10=2000 (arbitrary), change sz=1024*blocks I am locking out, run the
program below (writebad.c), and then run fsck to clean up the free list.
Here, a block is a logical block == 1 K, so there are 13 blocks/track on an
RLL disk.  Fsck should complain about (number of blocks locked out from bad
tracks) + (1 block from indirect block 2000) as duplicated on free list and
remove them.  Then the next fsck run will be clean and the disk is now
perfectly normal.  The file .badtracks should have all permissions missing.
I have not gotten a console message about bad blocks in the couple of months
since I fixed my filesystem this way.

If you trust your calculations, you can just adjust this program to map out
your own defects.  Otherwise, you can write a simple bad block sniffer
program for the raw device, remembering to seek away after reading every
sector.

I have no idea whether this problem needs an RLL disk or whatever to show
up, but it might explain some of the anomalous behavior that has been recently
reported in this newsgroup with two ST-4096 drives.

A. Lester Buck		...!texbell!moray!siswat!buck


/* write indirect block with list of bad blocks */
/* this for a partition that starts at cyl 504 */

#include <fcntl.h>

#define	DIRBLOCK	(2000L)	/* arbitrary choice for indirect block */

unsigned long	badblocks[1024/sizeof(long)] =
		{
		2743, 2744, 2745, 2746, 2747, 2748, 2749, 2750,
		2751, 2752, 2753, 2754, 2755,
		8489, 8490, 8491, 8492, 8493, 8494, 8495, 8496,
		8497, 8498, 8499, 8500, 8501,
		/* for example, the following defect is at cyl 806, head 3 */
		19669, 19670, 19671, 19672, 19673, 19674, 19675, 19676,
		19677, 19678, 19679, 19680, 19681,
#if 0	/* my partition was auto-resized to end before this defect */
		30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011,
		30012, 30013, 30014, 30015, 30016
#endif
		};

main()
{
    int   fd, nwrite;

    if ((fd = open("/dev/rdsk/1s2", O_WRONLY)) == -1) {
	perror("open /dev/rdsk/1s2 failed");
	exit(1);
    }
    if (lseek(fd, DIRBLOCK*1024L, 0) == -1) {
	perror("seek to block failed");
	exit(1);
    }
    if ((nwrite = write(fd, badblocks, sizeof(badblocks))) == -1) {
	perror("write badblocks failed");
	exit(1);
    }
    printf("wrote %d bytes\n", nwrite);
}

-- 
A. Lester Buck		...!texbell!moray!siswat!buck