Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!caen!umich!terminator!pisa.citi.umich.edu!rees
From: rees@pisa.citi.umich.edu (Jim Rees)
Newsgroups: comp.sys.apollo
Subject: Re: DN4500 arbitrarily overloads itself (was Re: (none))
Message-ID: <52629c2f.1bc5b@pisa.citi.umich.edu>
Date: 25 Jun 91 16:50:54 GMT
References: <0677436884@INESCN.RCCN.PT> <DPASSAGE.91Jun20122344@soda.berkeley.edu> <1991Jun24.002623.18899@gtephx.UUCP>
Sender: usenet@terminator.cc.umich.edu (usenet news)
Reply-To: rees@citi.umich.edu (Jim Rees)
Organization: University of Michigan IFS Project
Lines: 63

In article <1991Jun24.002623.18899@gtephx.UUCP>, wilsonj@gtephx.UUCP (Jay Wilson) writes:

  I saw this posting and I could not resist having one of my partners
  in crime (there are 6 of us Sys_admins) respond to it.  He has been tracking
  the Mutex Lock problem for over a year now and this is what he had to say.

  ...

  The error will rear its ugly head with no warning or pattern,
  and once you get it you MUST reboot and run the long SALVOL
  to appease its appetite for disaster.

I find this hard to believe.  I've never had to run a long salvol after
getting a stuck sfcb mutex, and I can't think of anything that a long salvol
might do that would fix it.

  If you would like more information on exactly what "sfcb hash table" and
  "mutex lock" are, please refer to a copy of the "Domain/OS Design Principles"
  014962-A00 pages 9-14,9-15.

That paper was also published in the Atlanta Usenix Proceedings, which I
think was summer 1986.  It's also available by ftp from pisa.citi.umich.edu.
I think it's an excellent paper and everyone should read it.

The basic problem with putting mutex locks in shared memory is that any old
program can go and trash them, and then you're stuck.  What's needed is a
true object oriented architecture with tagged storage, like the old Intel
432 or the IBM System 38.  But the trend seems to be in the opposite
direction, and operating systems seem to be getting more primitive, yet
bloated, every year.  Multitasking was common on all computers in the mid
1960s, then pretty much disappeared in the 80s when everyone started running
MS-DOS.  I'm waiting for the days when we all have to start using batch
again.  All this is enough to make an old systems guy like me want to retire
to a small midwestern town and spend his summers in places like Tanjung
Pinang.

Anyway, where was I?  Oh yes, mutex locks.  These problems are nearly
impossible to track down.  Since the sfcbs are central to ios, it's very
hard to do anything in the debugger if, for example, you've set a breakpoint
in mutex_$lock.  Everybody and his mother calls mutex_$lock, so you might
have to hit it 10000000 times before catching the one time that has no
matching unlock.  And then, how do you know when that happens?  It's a
Turing completion problem.  How do you tell the debugger to breakpoint on
something that isn't going to happen?  "Please stop the next time
mutex_$unlock *isn't* called."  And if you do manage to catch it, there you
are with the sfcbs locked, and you can't do any IO.  And remember that the
missing unlock can happen in any process.

Even worse is the case where it isn't an unmatched lock, it's some random
trashing of memory that happens to scribble on the sfcb.  It can happen at
any time, in any process.

Last time I wrote a type manager (for AFS), I had a few of these problems.
I couldn't debug them.  I fixed them by tenacious examination of source
code.  I suspect that's how Apollo engineers fix them too, the few who are
left who even know what an sfcb is.

To Apollo's credit, I have to say that I haven't seen a single stuck mutex
since I installed sr10.3.  I suspect there were some problems with TCP
before this but I wouldn't swear to it -- it may have been my screwy type
manager.

Enough ranting.