Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!rochester!rutgers!husc6!cmcl2!brl-adm!adm!rbj@icst-cmr.arpa
From: rbj@icst-cmr.arpa (Root Boy Jim)
Newsgroups: comp.unix.wizards
Subject: SI 9900 hangs
Message-ID: <8585@brl-adm.ARPA>
Date: Fri, 31-Jul-87 11:47:25 EDT
Article-I.D.: brl-adm.8585
Posted: Fri Jul 31 11:47:25 1987
Date-Received: Sun, 2-Aug-87 00:42:35 EDT
Sender: news@brl-adm.ARPA
Lines: 66


   eichelbe@nadc.arpa (J. Eichelberger) writes:
   > Every so often we get a system hang. All activity on the SI controller
   > stops.  Hitting the reset on the toggle switch restarts everything.  Most
   > of the time we don't see any error messages. If we do see one, it's
   > hp0: not ready

   We get these too.  The last time it happened, I checked the registers,
   and all 5 drives had just completed a seek (a sign that the massbus
   datapath is marked "busy") and they all had the UNS (unsafe) bit set.

We get these also, but only have two disks: a CDC 9766 (RM05) and an Eagle.
I haven't checked the error bits, but flipping the switch works for us too.
This isn't usually a problem unless I am at home.

   cithex.caltech.edu has the same problem.  They run VMS.

Too bad.

   The SI local office says this is a common problem, typically caused
   by marginal power supply output.  I've not had the chance to verify
   this (there isn't enough downtime on this machine for my purposes).

Well, the FE's tweaked our voltage, and it seemed to help a bit, but
we still get the error. Perhaps too much current is being drawn and the
voltage drops anyway.

   >   We are using the standard 4.3 BSD hp.c for the driver.

   I sure hope that you're using error-free packs.  The error position
   determination algorithm in that driver depends on separate counters
   for the number of bytes DMA'd and the number of bytes read/written;
   but the SI 9900 uses the first counter for both, so on an error, the
   driver thinks that more sectors were written than actually were.

I hope so too. Either that, or layout your partition table to avoid
any cylinders (or tracks) with bad sectors on them.

Another solution I might propose is to hack the driver so that it
compares the desired sector with the bad block table *before* it
takes the BSE hit instead of after. Yes, I know, it does take time, and
since most transfers are multi sector, you'd have to break up a `block'
that contained a bad sector into as many as three transfers. Of course,
you could also mark the adjacent sectors bad, in groups of eight (assuming
4k blocks), but I think the bad144 scheme would map the sectors backwards
within the block. Oy!

   (Before you start flaming about SI:  that problem is easy to work
   around compared to some of the bugs I've seen in the Emulex SC7000).

Mangler: please mail me the scoop on Emulex. I also work on a system
that has them.

   Seismo has a version of hp.c that works around this by looking at
   the track/sector register (HPDA) instead.  It can still lose data
   silently, but if retries are a rare event, it can be lived with.

For $300 (I think), SI will sell you their version. The 4.2 one seemed
a bit better than the 4.3 one tho. It also has hacks for online formatting,
header verification, and you don't have to reboot when you add bad blocks.

   Don Speck   speck@vlsi.caltech.edu  {ll-xn,rutgers,amdahl}!cit-vax!speck

	(Root Boy) Jim Cottrell	<rbj@icst-cmr.arpa>
	National Bureau of Standards
	Flamer's Hotline: (301) 975-5688