Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!aramis.rutgers.edu!athos.rutgers.edu!rbthomas From: rbthomas@athos.rutgers.edu (Rick Thomas) Newsgroups: comp.os.minix Subject: Re: 1.3 hard disk question Message-ID: Date: 21 Jun 89 22:30:41 GMT References: <19106@cup.portal.com> Organization: Rutgers Univ., New Brunswick, N.J. Lines: 109 I finally solved the problem. It turned out to be caused by having a hard disk with 5 heads instead of 4, as fdisk and buddies expected. The above is the executive summary, now for the gory details. First a summary of the problem: I have two hard disks in my AT. One (the first one -- 20 MB) I have reserved for DOS. (DOS refuses to boot unless there is an active partition on the first hard disk, so I gave it the whole thing.) The second disk (30 MB) is for Minix. In Minix it is called /dev/hd5 and the partitions are /dev/hd{6,7,8,9} While booting -- during the copy root to ramdisk phase -- I get the diagnostic "Invalid partition table" (always at block 45 for some reason) and partitions 6,7,8 and 9 all have length zero according to "dd if=/dev/hd6 of=/dev/null" (where 6 = one of 6,7,8,9) DD claims to have copied "0+0" records. "fdisk /dev/hd5" shows a perfectly normal partition table with partition 1 equal to the whole disk. mkfs /dev/hd6 fails with a write error, so I have put my /usr partition on /dev/hd5, and all seems well except for the "illegal partition table" message and the fact that the boot-time fsck refuses to check /dev/hd5 (it won't even give me the option! It only accepts 1-4 and 5-9.) Run-time fsck works fine. Now an exciting tale of tracking the wild bug monster in its native habitat: My initial posting of the problem brought back two responses, one from Andy Tanenbaum and the other from Bruce Evans. Andy had no immediate help (He noted that his home machine has two disks and it works just fine!) but Bruce pointed out that doing mkfs on /dev/hd5 is a "no-no" because it writes zero bytes all over the boot block (which includes the partition table) I had already discovered this myself, and restored (I thought) the partition table using the Minix fdisk program. Certainly fdisk displayed a partition table that looked OK to me. My first step was to use grep to go looking through the whole source tree for the place that was producing the "Invalid partition" message. I found it in kernel/at_wini.c . The actual message was somewhat longer than that and included the number of the disk that it judged bad, but the message was getting truncated by the screen-will-not-wrap (mis-)feature of vanilla Minix 1.3 out of the box. I edited the message to put a newline at the beginning and end, so I could see which disk was in trouble (hoping to see that it was as simple as that Minix was not happy with the DOS partitioning of my first disk -- i.e. something I could ignore.) and built a new kernel and boot disk. Unfortunately, it turned out to be the Minix disk that was causing the trouble. So I hacked a little more on at_wini.c, and made it print out the in-core partition table that it built from what it read off the boot sector. (Putting kernel printf's into system initialization code is a scary business. If you happen to rip open a timing window and print to the console before its driver gets initialized, you can do all kinds of bad stuff! I was working without a net, too, in the sense that if I clobbered my disk, I would have to go back to square one and spend three or four evenings reloading from the distribution disks and re-applying patches.) Andy was the one who suggested that putting printf's into the driver is a useful debugging technique when nothing else seems to help. Thanks, Andy! The upshot was that the partition table seemed to be OK on disk, but was all zeros in core. So now its time to read the code. (-8 When all else fails, read the instructions... 8-) A line-by line examination of the code revealed that there is a "magic number" (0xAA55) in the last two bytes of the boot block that at_wini.c wants to see, and if it doesn't see it, it prints the "Invalid partition" message and zeros out the in-core partition table. The zero partition tables was why "dd if=/dev/hd6" found nothing to read. I hacked together a version of fdisk that put back the magic number and ran it. Lo and behold, when I rebooted, the nasty message had gone away and I could read from /dev/hd6. But the in-core partition table was a mess. (The on-disk version still looked ok when I printed it out with fdisk. The in-core version had non-zero numbers in it but they were wrong.) Time to read the code again. I had recently retrieved a public domain disk partitioning program from the Clarkson University software archives, and reading it gave me a very good description of how the on-disk partition table was supposed to look. (If anyone is interested, I can post the relevant parts of the code and comments. The formats are definitely non-obvious!) The program itself was not much help, because it was tightly coupled with an odd-ball disk driver TSR that I did not want to try out just then. (I had enough troubles!) In the light of that new knowledge, I looked at fdisk with a critical eye, and found that it had hard-coded assumptions about the number of sectors and heads on a disk. Those assumptions were wrong for my second (Minix) disk, though they were right for my first (DOS) disk. I hacked on fdisk til I had a version that hard-coded the right assumptions for my disk (which was *much* easier than doing the job right by making it find out from the bios what the correct assumptions were!) I ran it, and everything is beautiful again. Now "all" I have to do is backup my /dev/hd5 filesystem, run mkfs on /dev/hd6, and reload. Shouldn't take me more than a week. Anybody want to hack fdisk to do the job *right* once and for all? -- Rick Thomas uucp: {ames, cbosgd, harvard, moss, seismo}!rutgers!jove.rutgers.edu!rbthomas arpa: rbthomas@JOVE.RUTGERS.EDU Phone: (201) 932-4301