Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!elroy.jpl.nasa.gov!ncar!asuvax!hrc!gtephx!wilsonj From: wilsonj@gtephx.UUCP (Jay Wilson) Newsgroups: comp.sys.apollo Subject: Re: DN4500 arbitrarily overloads itself (was Re: (none)) Summary: Mutex Virus!!!! Message-ID: <1991Jun24.002623.18899@gtephx.UUCP> Date: 24 Jun 91 00:26:23 GMT References: <0677436884@INESCN.RCCN.PT> Organization: gte Lines: 144 In article , dpassage@soda.berkeley.edu (David G. Paschich) writes: > In article <0677436884@INESCN.RCCN.PT>, > JCF%INESCN.RCCN.PT@CUNYVM.CUNY.EDU (Joao Canas Ferreira) writes: > ...... > During one of the last fits, the user tried to logout. After some > time, he got the (almost) usual question (Blast ? (Y/N)). After answering > Yes and waiting some time, the message 'Unable to obtain scfb hash table mutex > lock from (stream manager/ scfb)' ? > ...... I saw this posting and I could not resist having one of my partners in crime (there are 6 of us Sys_admins) respond to it. He has been tracking the Mutex Lock problem for over a year now and this is what he had to say. (FLAME ON) Dear Mr. Ferreira, Your message (0677436884@INESCN.RCCN.PT) concerning the "mutex lock/sfcb hash table" error struck a nerve right into the core of my spine. We have around 530 workstations at our site and we have been attempting to combat this virus for many months now. (I like to call it a "virus" as there is no way to control it and no one at Apollo can tell us what REALLY causes it or how to stop it. Just by having sys_admins from various sites throw the word "virus" around when referring to something in the Apollo operating system should strike terror into Apollo/HP sales staff, and maybe someone will with pull will prime the Apollo R&D engineering pump and get a resolution.) The error will rear its ugly head with no warning or pattern, and once you get it you MUST reboot and run the long SALVOL to appease its appetite for disaster. Please note a few items: 1). When you run the long SALVOL be sure to parse the options out as follows: 1 -f -a -s -t We determined that changes at 10.2 SALVOL no longer allow you to string options together as "1 -fast". 2). The long SALVOL will clear up the problem for varying periods of time, but there are no guarentees. It does seem to do more good than harm. 3). Once a node gets "infected" with this error it seem to get it again and again. "Uninfected" nodes seem to be o.k. until ... 4). INVOL and reloading software did not help the nodes that had the errors frequently, but for some strange reason replacing the CPU made the errors less frequent. (We had one machine getting this error daily - replaced the CPU - now it only gets it weekly.) 5). There are a few patches from Apollo to correct this, but none of them have put a dent in our problems. Patches: 139 and 196 for example 6). The problem does not seem to be machine type (3000/4000/3500) specific, and we get it on all types. We even had Apollo check CPU rev levels on the machines. The virus is much more active at SR10.2 though !!! We thought that maybe our users were using some tool that was causing the problems, so I asked a user who saw the hang daily on his node to work on another node for a while. He only saw the hang once in two weeks then. Also, I could log into his workstation and BOOM - "sfcb/mutex" We also noticed the case of a user who got a problem on their node going to another node and "infecting" it. (That user has now been labeled, and must shout "unclean, unclean !" when comming in contact with other users. (A punishment I would not wish on anyone.) Another oddity is most of our people use the same tools, have the same load of o/s, and work on the same data yet no one can explain (SURPRISE!) why some nodes hang and other nodes never see the virus. (I think its the will of Zeus as a punishment for Apollo starting his own business outside of the Mount Olympus tax jurisdiction.) If you would like more information on exactly what "sfcb hash table" and "mutex lock" are, please refer to a copy of the "Domain/OS Design Principles" 014962-A00 pages 9-14,9-15. I found this to be a better explanation than I got from the response center of," It's a table that controls everything." Apollo claims the problem has been fixed at SR10.3, (I won't even get started on the extreme, orgasmic joy experienced when hearing this phrase) but that is yet to be determined. We have SR10.3 onsite, but it will be a while before we can get all 530 machines up on it. In Apollos defense I can say that none of the 6 machines we have running SR10.3 have seen this error ... maybe it's because we don't use them yet ? I do have an open APR/SR and an open call A2047527, but it's probably been closed because I was out for an afternoon and missed the, "call me by the end of the day or I'm closing it" call. My latest efforts to get Apollo going on this problem seem to be working better, and after a few calls with some of the upper Apollo support personnel I feel they are actually looking into the virus. I will keep you posted as there are probably many items I forgot on this as the portion of my brain that deals with the mutex lock seems to get fuzzier each day as I burst various blood vessels in dismay, but please - on behalf of myself, your family, and your Apollo sys_admin brethern everywhere ... don't hold your breath. -- Matt Ferris Systems Programmer AG Communication Systems 2500 West Utopia Road Phoenix, AZ 85027 Phone 1-(602)-582-7634 Fax 1-(602)-581-4967 (FLAME OFF) I am the only one in the group that monitors what is going on on the net, that is why Matt fed his reply back via me. If you have any replies for him, please send them to him directly at: UUCP : {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!ferrism INTERNET: gtephx!ferrism@asuvax.eas.asu.edu Thanks -- Jay Wilson (wilsonj@gtephx) SR Systems Programmer UUCP : {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!wilsonj INTERNET: gtephx!wilsonj@asuvax.eas.asu.edu AG Communication Systems, Phoenix, AZ voice (602) 581-4496 fax (602) 581-4967 "A river that overflows its banks is never a problem until a road is built across it."