Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!caesar.cs.montana.edu!ogicse!emory!stiatl!rsiatl!jgd
From: jgd@rsiatl.UUCP (John G. De Armond)
Newsgroups: comp.unix.i386
Subject: Re: Disks Hang Under 2.0.2 SCSI
Message-ID: <907@rsiatl.UUCP>
Date: 12 Dec 89 19:37:40 GMT
References: <654400003@cdp>
Reply-To: jgd@rsiatl.UUCP (John G. De Armond)
Organization: Radiation Systems, Inc. (a thinktank, motorcycle, car and gun works facility)
Lines: 58

In article <654400003@cdp> steve@cdp.UUCP writes:
>
>
>SUMMARY -- README
>-------
>We have been experiencing regular crashes running under
>Interactive 2.0.2 with 3 SCSI disks on an aha1542a.  Later in
>this message is a script which crashes our machine.  The
>purpose of this message is to find other people who are
>willing to try to replicate these crashes on various
>machines.  I encourage folks to try out the script, even if
>they do not have our exact hardware configuration.  This will
>help us to better understand the whether the problem lies in
>hardware or in 2.0.2.  
>
>DETAILS
>-------
>The symptom of the crashes is that all processes continue to
>run, but any process that goes for the disk hangs.  So, getty
>prints the login prompt, and accepts a name at login:, but
>when it goes to spawn login, the exec hangs the system.
>Switch to a different virtual console, and repeat the same
>thing.  emacs works fine until it tries to auto-save, open
>a file, etc...


Steve, 

We have had the same failure here under similiar conditions.  Configuration
here is an Adaptec host adaptor and 2 380 mb Newbury data drives.  

Our problem seemed to manifest itself mostly under pathalogical conditions,
such as when a bad block is discovered.  I've also seen it when I've been
running a script similiar to yours designed to hammer a new hard disk before
putting it into service.

The external symptoms are as you note PLUS I notice that the activity LED
on the Adaptec board is stuck on AND the activity LED on one of the drives
is on continously.

We now have a bit more data in that it occurs on two totally different
drive types.

Without any investigation other than external observation, I suspect that the 
problem has to do with either a buffer getting overrun or a problem with
a task releasing the scsi bus to another one.

The fact that the problem only occurs either when 2 drives are heavily loaded
or when an error condition happens - which appears from the LED activity 
to tie the bus up for a spell - should be a major clue.  I absolutely
cannot cause this failure by any combination of loading on one drive.

John

-- 
John De Armond, WD4OQC                     | The Fano Factor - 
Radiation Systems, Inc.     Atlanta, GA    | Where Theory meets Reality.
emory!rsiatl!jgd          **I am the NRA** |