Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request From: jpl@allegra.att.com Newsgroups: comp.sys.sun Subject: fsck -p not checking everything Message-ID: <8902161243.AA00379@rice.edu> Date: 28 Feb 89 12:27:14 GMT Sender: usenet@rice.edu Organization: Sun-Spots Lines: 48 Approved: Sun-Spots@rice.edu Original-Date: Thu, 16 Feb 89 07:35:44 EST X-Sun-Spots-Digest: Volume 7, Issue 172, message 5 of 15 We ran into a similar problem with the 4.3 fsck. The basic problem was this. fsck waited for parallel fscks to complete using if (preen) { union wait status; while (wait(&status) != -1) sumstatus |= status.w_retcode; } However, if the process terminated abnormally, retcode was 0, so fsck failed to detect the error. We changed the code to be if (preen) { union wait status; while (wait(&status) != -1) { if (status.w_termsig) { printf("child died with signal %d during pass %d\n", status.w_termsig, passno); sumstatus |= 8; } else sumstatus |= status.w_retcode; } } This treats abnormal termination (MUCH more serious than a bit of file system corruption) as an error as well. How could a process terminate abnormally, you might ask? There's a line in pass1 that looks like ndb = howmany(dp->di_size, sblock.fs_bsize); ndb (the number of data blocks) is subsequently used as an array index. But we found that with suitably huge di_size, howmany could make ndb go negative, so the array reference caused a dump. We cleaned that one up by adding the check... if (ndb < 0) { if (debug) printf("bad size %d ndb %d:", dp->di_size, ndb); goto unknown; } Until we put in these fixes, we had a file system that would make fsck drop core, but fsck -p didn't notice it, so the condition persisted for weeks. We finally caught the problem when we ran a fsck without the -p, and noticed that it died on that file system. John P. Linderman Department of Bounced fsck's allegra!jpl