Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rochester!rutgers!husc6!cmcl2!brl-adm!adm!rbj@icst-cmr.arpa From: rbj@icst-cmr.arpa (Root Boy Jim) Newsgroups: comp.unix.wizards Subject: SI 9900 hangs Message-ID: <8585@brl-adm.ARPA> Date: Fri, 31-Jul-87 11:47:25 EDT Article-I.D.: brl-adm.8585 Posted: Fri Jul 31 11:47:25 1987 Date-Received: Sun, 2-Aug-87 00:42:35 EDT Sender: news@brl-adm.ARPA Lines: 66 eichelbe@nadc.arpa (J. Eichelberger) writes: > Every so often we get a system hang. All activity on the SI controller > stops. Hitting the reset on the toggle switch restarts everything. Most > of the time we don't see any error messages. If we do see one, it's > hp0: not ready We get these too. The last time it happened, I checked the registers, and all 5 drives had just completed a seek (a sign that the massbus datapath is marked "busy") and they all had the UNS (unsafe) bit set. We get these also, but only have two disks: a CDC 9766 (RM05) and an Eagle. I haven't checked the error bits, but flipping the switch works for us too. This isn't usually a problem unless I am at home. cithex.caltech.edu has the same problem. They run VMS. Too bad. The SI local office says this is a common problem, typically caused by marginal power supply output. I've not had the chance to verify this (there isn't enough downtime on this machine for my purposes). Well, the FE's tweaked our voltage, and it seemed to help a bit, but we still get the error. Perhaps too much current is being drawn and the voltage drops anyway. > We are using the standard 4.3 BSD hp.c for the driver. I sure hope that you're using error-free packs. The error position determination algorithm in that driver depends on separate counters for the number of bytes DMA'd and the number of bytes read/written; but the SI 9900 uses the first counter for both, so on an error, the driver thinks that more sectors were written than actually were. I hope so too. Either that, or layout your partition table to avoid any cylinders (or tracks) with bad sectors on them. Another solution I might propose is to hack the driver so that it compares the desired sector with the bad block table *before* it takes the BSE hit instead of after. Yes, I know, it does take time, and since most transfers are multi sector, you'd have to break up a `block' that contained a bad sector into as many as three transfers. Of course, you could also mark the adjacent sectors bad, in groups of eight (assuming 4k blocks), but I think the bad144 scheme would map the sectors backwards within the block. Oy! (Before you start flaming about SI: that problem is easy to work around compared to some of the bugs I've seen in the Emulex SC7000). Mangler: please mail me the scoop on Emulex. I also work on a system that has them. Seismo has a version of hp.c that works around this by looking at the track/sector register (HPDA) instead. It can still lose data silently, but if retries are a rare event, it can be lived with. For $300 (I think), SI will sell you their version. The 4.2 one seemed a bit better than the 4.3 one tho. It also has hacks for online formatting, header verification, and you don't have to reboot when you add bad blocks. Don Speck speck@vlsi.caltech.edu {ll-xn,rutgers,amdahl}!cit-vax!speck (Root Boy) Jim Cottrell National Bureau of Standards Flamer's Hotline: (301) 975-5688