Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!caen!umich!terminator!pisa.citi.umich.edu!rees From: rees@pisa.citi.umich.edu (Jim Rees) Newsgroups: comp.sys.apollo Subject: Re: DN4500 arbitrarily overloads itself (was Re: (none)) Message-ID: <52629c2f.1bc5b@pisa.citi.umich.edu> Date: 25 Jun 91 16:50:54 GMT References: <0677436884@INESCN.RCCN.PT> <1991Jun24.002623.18899@gtephx.UUCP> Sender: usenet@terminator.cc.umich.edu (usenet news) Reply-To: rees@citi.umich.edu (Jim Rees) Organization: University of Michigan IFS Project Lines: 63 In article <1991Jun24.002623.18899@gtephx.UUCP>, wilsonj@gtephx.UUCP (Jay Wilson) writes: I saw this posting and I could not resist having one of my partners in crime (there are 6 of us Sys_admins) respond to it. He has been tracking the Mutex Lock problem for over a year now and this is what he had to say. ... The error will rear its ugly head with no warning or pattern, and once you get it you MUST reboot and run the long SALVOL to appease its appetite for disaster. I find this hard to believe. I've never had to run a long salvol after getting a stuck sfcb mutex, and I can't think of anything that a long salvol might do that would fix it. If you would like more information on exactly what "sfcb hash table" and "mutex lock" are, please refer to a copy of the "Domain/OS Design Principles" 014962-A00 pages 9-14,9-15. That paper was also published in the Atlanta Usenix Proceedings, which I think was summer 1986. It's also available by ftp from pisa.citi.umich.edu. I think it's an excellent paper and everyone should read it. The basic problem with putting mutex locks in shared memory is that any old program can go and trash them, and then you're stuck. What's needed is a true object oriented architecture with tagged storage, like the old Intel 432 or the IBM System 38. But the trend seems to be in the opposite direction, and operating systems seem to be getting more primitive, yet bloated, every year. Multitasking was common on all computers in the mid 1960s, then pretty much disappeared in the 80s when everyone started running MS-DOS. I'm waiting for the days when we all have to start using batch again. All this is enough to make an old systems guy like me want to retire to a small midwestern town and spend his summers in places like Tanjung Pinang. Anyway, where was I? Oh yes, mutex locks. These problems are nearly impossible to track down. Since the sfcbs are central to ios, it's very hard to do anything in the debugger if, for example, you've set a breakpoint in mutex_$lock. Everybody and his mother calls mutex_$lock, so you might have to hit it 10000000 times before catching the one time that has no matching unlock. And then, how do you know when that happens? It's a Turing completion problem. How do you tell the debugger to breakpoint on something that isn't going to happen? "Please stop the next time mutex_$unlock *isn't* called." And if you do manage to catch it, there you are with the sfcbs locked, and you can't do any IO. And remember that the missing unlock can happen in any process. Even worse is the case where it isn't an unmatched lock, it's some random trashing of memory that happens to scribble on the sfcb. It can happen at any time, in any process. Last time I wrote a type manager (for AFS), I had a few of these problems. I couldn't debug them. I fixed them by tenacious examination of source code. I suspect that's how Apollo engineers fix them too, the few who are left who even know what an sfcb is. To Apollo's credit, I have to say that I haven't seen a single stuck mutex since I installed sr10.3. I suspect there were some problems with TCP before this but I wouldn't swear to it -- it may have been my screwy type manager. Enough ranting.