Path: utzoo!attcan!uunet!husc6!rice!sun-spots-request
From: eap@bu-it.bu.edu (Eric A. Pearce)
Newsgroups: comp.sys.sun
Subject: Re: disk sequencer error
Message-ID: <8901162027.AA12705@bu-it.BU.EDU>
Date: 25 Jan 89 00:51:10 GMT
Sender: usenet@rice.edu
Organization: Sun-Spots
Lines: 78
Approved: Sun-Spots@rice.edu
Original-Date: Mon, 16 Jan 89 15:27:10 EST
X-Sun-Spots-Digest: Volume 7, Issue 118, message 3 of 11

tomc@dftsrv.gsfc.nasa.gov (Tom Corsetti):

 >Recently, our Sun 3/260 crashed because of a power outage....
 >Well, today, almost a
 >week later, I shutdown and rebooted, and got the message:
 >  xy0a: read retry (disk sequencer error) -- blk #495, abs blk #495
 >Is this a serious disk problem that I should worry about?...

dinah@shell.UUCP (Dinah Anderson):

 >...
 >I would like to know what the errors mean and under what circumstances
 >they occur. I would also like to know what we should do about them.

I looked up the error in my Xylogics 451 manual:

"Disk Sequencer Error - The disk sequencer did not finish its operation
within the allowed time.  Several factors may cause this problem. 

  - The 451 did not receive the servo clock signal from the the selected
    disk drive.  Check the B cable; if the connection is good, try a
    different B cable port on the 451.

  - The 451 is not receiving any read data from the selected drive. Check
    the B cable.

  - The Multibus may be preventing the 451 from gaining proper access."

The manual entry I quote from above suggests the problem could be with the
cabling or the controller itself, but this has not been the case for us.
A bad controller usually spews out large numbers of errors with random
block numbers over more than one disk.  A bad cable will produce random
block errors on one drive (since it's unlikely that more than one cable
would crap out at a time.)  We had drive cable problems on some
rack-mounted systems (3/180's and 3/280's). I believe they were caused by
repeated flexing of the drive cables by the doors on the back of the
cabinets.  The older rack setups have several feet of cable that  dangle
out of the back of the cabinet and move every time you open the door. (The
doors have since been removed - I have not seen any cooling problems so
far).  

A bad disk usually will have errors that give sequential block numbers or
at least repeat them numerous times.  If you only get an occasional disk
error, such as one a week, you might be safe to just map or slip the bad
spots, but in my experience, any errors that occur with regularity are
indicative of future trouble.  

If you have a Sun hardware contract, I would have them replace it as soon
as possible.  If they balk at replacing a drive with only a few errors,
push them a bit.  It *is* possible for systems to run for long periods
without disk problems.           

I would do a full level 0 of the disk as soon as possible.  If you act
before a crisis, you can have a scheduled downtime for a drive
replacement.  You would do a level 0 dump and Sun would come in and
replace it.  This would make the restore much easier, as you would not
have to worry about multi-level backups, not to mention the time you would
save.

I have seen this error on Fujitsu 2351's ("single" Eagle) and 2361's
("double" or "super" Eagle).  It was always accompanied by a massive
number of disk errors.  

Our local Sun field service will replace single Eagles as a whole but they
replace only parts of double Eagles (in this case the HDA and the servo
board).  

The "Eagle" series of drives seem to be rather sensitive to power
fluctuations.  the newer Hitachi DK815-10 and NEC D2363 seem to be more
tolerant. 

 -e

 Eric Pearce                                   ARPANET eap@bu-it.bu.edu
 Boston University Information Technology      CSNET   eap%bu-it@bu-cs
 111 Cummington Street                         JNET    jnet%"ep@buenga" 
 Boston MA 02215                               UUCP    !harvard!bu-cs!bu-it!eap 
 617-353-2780 voice  617-353-6260 fax          BITNET  ep@buenga