Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request
From: jpl@allegra.att.com
Newsgroups: comp.sys.sun
Subject: fsck -p not checking everything
Message-ID: <8902161243.AA00379@rice.edu>
Date: 28 Feb 89 12:27:14 GMT
Sender: usenet@rice.edu
Organization: Sun-Spots
Lines: 48
Approved: Sun-Spots@rice.edu
Original-Date: Thu, 16 Feb 89 07:35:44 EST
X-Sun-Spots-Digest: Volume 7, Issue 172, message 5 of 15

We ran into a similar problem with the 4.3 fsck.  The basic problem was
this.  fsck waited for parallel fscks to complete using

	if (preen) {
		union wait status;
		while (wait(&status) != -1)
			sumstatus |= status.w_retcode;
	}

However, if the process terminated abnormally, retcode was 0, so fsck
failed to detect the error.  We changed the code to be

	if (preen) {
		union wait status;
		while (wait(&status) != -1) {
			if (status.w_termsig) {
				printf("child died with signal %d during pass %d\n",
					status.w_termsig, passno);
				sumstatus |= 8;
			} else
				sumstatus |= status.w_retcode;
		}
	}

This treats abnormal termination (MUCH more serious than a bit of file
system corruption) as an error as well.  How could a process terminate
abnormally, you might ask?  There's a line in pass1 that looks like

	ndb = howmany(dp->di_size, sblock.fs_bsize);

ndb (the number of data blocks) is subsequently used as an array index.
But we found that with suitably huge di_size, howmany could make ndb go
negative, so the array reference caused a dump.  We cleaned that one up by
adding the check...

	if (ndb < 0) {
		if (debug)
			printf("bad size %d ndb %d:",
				dp->di_size, ndb);
		goto unknown;
	}

Until we put in these fixes, we had a file system that would make fsck
drop core, but fsck -p didn't notice it, so the condition persisted for
weeks.  We finally caught the problem when we ran a fsck without the -p,
and noticed that it died on that file system.

John P. Linderman  Department of Bounced fsck's  allegra!jpl