Path: utzoo!attcan!uunet!nuchat!moray!siswat!buck From: buck@siswat.UUCP (A. Lester Buck) Newsgroups: comp.unix.microport Subject: bad block bug on second disk in V/AT 2.4.0L Keywords: bad block V/AT 2.4 workaround Message-ID: <433@siswat.UUCP> Date: 2 Aug 89 02:57:22 GMT Organization: Photon Graphics, Houston Lines: 110 Some time ago I asked the net about a weird problem I was having with the following setup: Microport System V/AT 2.4.0L Adaptec 2372 disk controller Seagate 277R - disk 0 Mitsubishi 535 - disk 1 The Mitsubishi 535 had a few bad spots, they were found by the Microport surface scan, they showed up on "showbad 1", but they were always ignored when using a filesystem on the disk, generating irritating console error messages and randomly losing data. Unit 0 works perfectly, correctly mapping out its bad sectors. Only Bill Vajk (learn@igloo) responded to tell me he had run across the same problem and was discouraged enough to consider moving back to 2.3, since he also needed two drives to handle news. I finally figured out how to work around this problem. The "obvious" solution was to make a file and use fsdb to put all the bad sectors in one file. Then they are off the free list and never heard from again (grep -v badfile on backups, etc.). Unfortunately, this just wasn't working and I was still getting the "Bad block" messages on the console on a regular basis. Here is the full workaround. The Microport 2.4 disk driver appartently does full track caching on any access to a track, so the message appears for the one bad sector if any of that track's sectors are read. Also, once that track is in the track cache, further reads from that track might NOT give the console message if they do not access the one bad sector. So if you write a simple program to find when the console message shows up, to really test it you need to seek away from the suspect sector, read anything to flush the track cache, then seek back, for each of the sectors on the suspect track. You should find every sector on the track with a bad sector gives a console message, even though most of the time the data is read correctly. All of this assumes the bad block sniffer program uses the raw device to bypass the buffer cache and the bad block table (which should be active for the block device, but of course isn't for Unit 1). I finally used the program at the end of this posting to make a single indirect block with the list of blocks on the track with the defects. My disk is a Mitsubishi 535 RLL with 977 tracks, 5 heads, 26 sectors/track, and /dev/rdsk/1s2 is a partition starting at cylinder 504. Before running this, I mount a newly mkfs'ed filesystem, touch .badtracks in the root directory, umount the filesystem, start fsdb, find .badtracks inode with 2i.fd, change a10=2000 (arbitrary), change sz=1024*blocks I am locking out, run the program below (writebad.c), and then run fsck to clean up the free list. Here, a block is a logical block == 1 K, so there are 13 blocks/track on an RLL disk. Fsck should complain about (number of blocks locked out from bad tracks) + (1 block from indirect block 2000) as duplicated on free list and remove them. Then the next fsck run will be clean and the disk is now perfectly normal. The file .badtracks should have all permissions missing. I have not gotten a console message about bad blocks in the couple of months since I fixed my filesystem this way. If you trust your calculations, you can just adjust this program to map out your own defects. Otherwise, you can write a simple bad block sniffer program for the raw device, remembering to seek away after reading every sector. I have no idea whether this problem needs an RLL disk or whatever to show up, but it might explain some of the anomalous behavior that has been recently reported in this newsgroup with two ST-4096 drives. A. Lester Buck ...!texbell!moray!siswat!buck /* write indirect block with list of bad blocks */ /* this for a partition that starts at cyl 504 */ #include #define DIRBLOCK (2000L) /* arbitrary choice for indirect block */ unsigned long badblocks[1024/sizeof(long)] = { 2743, 2744, 2745, 2746, 2747, 2748, 2749, 2750, 2751, 2752, 2753, 2754, 2755, 8489, 8490, 8491, 8492, 8493, 8494, 8495, 8496, 8497, 8498, 8499, 8500, 8501, /* for example, the following defect is at cyl 806, head 3 */ 19669, 19670, 19671, 19672, 19673, 19674, 19675, 19676, 19677, 19678, 19679, 19680, 19681, #if 0 /* my partition was auto-resized to end before this defect */ 30004, 30005, 30006, 30007, 30008, 30009, 30010, 30011, 30012, 30013, 30014, 30015, 30016 #endif }; main() { int fd, nwrite; if ((fd = open("/dev/rdsk/1s2", O_WRONLY)) == -1) { perror("open /dev/rdsk/1s2 failed"); exit(1); } if (lseek(fd, DIRBLOCK*1024L, 0) == -1) { perror("seek to block failed"); exit(1); } if ((nwrite = write(fd, badblocks, sizeof(badblocks))) == -1) { perror("write badblocks failed"); exit(1); } printf("wrote %d bytes\n", nwrite); } -- A. Lester Buck ...!texbell!moray!siswat!buck