Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!eecae!netnews.upenn.edu!rutgers!ucsd!ucsdhub!hp-sdd!ncr-sd!ncrlnk!uunet!convex!iex!ntvax!canoaf From: canoaf@ntvax.UUCP (Augustine Cano) Newsgroups: comp.sys.att Subject: Summary: Hard disk errors on a 3b1 Keywords: HDERR 3b1 disk errors Seagate Message-ID: <388@ntvax.UUCP> Date: 10 Feb 89 21:41:17 GMT Distribution: usa Organization: University of North Texas Lines: 72 Hello netland: I got 2 responses to my posting about intermittent HD errors with crash on my 3b1 with a Seagate 4096. This is the second such drive, the first one, still under warranty, overflowed the bad block table. Days can pass without any problem and suddenly I'll get half a dozen in a row. The actual error (with crash) is always as follows: Drive 0, cmd 0 #HDERR ST:51 ... (repeated usually 3 times) panic: Hard disk timeout Please record panic message. Press hardware reset to reboot. where `#HDERR:ST51 ...' has been at different times: #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4208 SC:4204 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8900 #HDERR ST:51 EF:10 CL:4280 CH:4202 SN:420E SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:8B00 #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:420A SC:4202 SDH:4226 DMACNT:FFFF DCRREG:96 MCRREG:8100 #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4200 SC:4204 SDH:4225 DMACNT:FFFF DCRREG:B5 MCRREG:8B00 #HDERR ST:51 EF:10 CL:4283 CH:4203 SN:4204 SC:4202 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8100 From my original posting: > Is it possible that the disk really does not have a bad spot but that a > combination of factors triggers a software bug in the kernel or driver? Christopher J. Calabrese from AT&T Bell Laboratories, Murray Hill, NJ said: > I've never run accross such problems with the disk drivers; > however, it could be a bad disk controler chip, or a bad ribon cable. > I've seen that around here before. Many, many thanks go also to Brant Cheikes, who sent a long and detailed account of exactly the same problem I'm having. On his advice, I called Ben Wollberg (415)678-1353 (8a-5p PST), who fixed his machine. Ben told me that the first thing he would do would be to backup the whole disk and reformat it. He said that this might make the problem disappear. If it didn't, I would probably have to have the disk repaired. Apparently the test they run to find and map the bad sectors takes 7 hours. I wonder if I can do something similar with the test disk. Can anybody out there tell me what I need to tell the test program to do an exhaustive format-write-read check that would detect all intermittent errors? In any case, a summary of Brant's response follows: > The problems all showed up as HDERR's logged to /usr/adm/unix.log. > The errors would come in groups of three or four, and would always be > accompanied by a mechanical whine from the disk. I believe that noise > indicates that the drive is "recalibrating," retracting the heads and > resetting itself in some way. The errors were highly intermittent; I > could go several days without an error, then suddenly get several in > one day. Weather did not seem to be a factor, nor did temperature. I > checked the power output from my power supply, and found no variation > even while the drive was running the random seek diagnostic test. > > Occasionally, the errors would cause recoverable disk errors. Things > like missing blocks in the free list, things that fsck could fix. No > data was ever lost, to my knowledge, but it really sucked having to > fsck the disk every few days. > > Then, the machine started crashing. > The accompanying whine in these cases lasted several seconds, and the > system was hung while it was going on. Then boom, the panic and a > reset was necessary. Well, I hope this helps someone out there... Augustine F. Cano