Xref: utzoo comp.sys.att:6046 comp.unix.questions:12690 comp.unix.wizards:15402 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!rochester!rutgers!att!homxb!homxc!mrb1 From: mrb1@homxc.ATT.COM (M.BAKER) Newsgroups: comp.sys.att,comp.unix.questions,comp.unix.wizards Subject: Sys V/386/3.2 UNIX system getting hung (?) Keywords: 6386, UNIX, kernel, confused Message-ID: <6226@homxc.ATT.COM> Date: 6 Apr 89 15:43:27 GMT Organization: AT&T BL Holmdel NJ USA Lines: 71 Hi --- Since the net was so helpful on my last query, I'd like to give it another try: We have an AT&T 6386E system running UNIX SysV/3.2. While running our application, it has been observed to 'hang'. Specifically, the application stops in the middle of things. More importantly, all the terminal I/O stops.......including the system console. You can't log in on a free getty. Anything you type gets echoed back to the screen, but nothing gets done with it. If you hit "Ctrl-Alt-Del", the screen displays a message saying "You must run shutdown before using Ctrl-Alt-Del" or something very similar to that. There is no "Fatal Error - Parity Check at ...." message or anything abnormal on the console. The only thing to do then (that seems to work for me) is to hit RESET. Well, rebooting kind of destroys all the clues. Since the kernel apparently never did a panic(), there's no dump available to look at with crash. If the hang occurred in the middle of the night, and time elapses before you reset the system, sar shows nothing past the last recorded 'checkpoint' before the system 'died'. I will furnish more details of our hardware configuration/software application upon request....for now, I think that these basic clues should be able to get us aimed in the right direction. My first suspicion: The 3.1 & 3.2 software notes state that if you "run out of free clists, all input/output activity from/to terminal ports and the console will cease. No warning message is printed by the system to show that it is out of clists". Sounded good at first, so we raised the NCLIST tunable parameter from 120 to 170 (recom- mended value for 4M machine) and then to 200 (the max. in mtune). Stil had the problem, though. Which leads to a couple of quick ques- tions: 1.) Can you check the number of free clists while the system is running? sar doesn't seem to be any help here, and I'm sure crash can reveal it but I'm not sure how to get to it. 2.) Is there any circumstance in which clists can get slowly used up (i.e., occasionally not returned to the free pool)? Also, could this problem be symptomatic of the time slicer interrupt going away (not being generated, or recognized) which robs UNIX of knowing that time is passing us by? Or are we just in some kind of major deadlock? I think that the processor is still alive, since console characters echo to the screen and it responds to the Ctrl-ALt-Del keyin. Plus this is a protected mode machine, so it's a little tougher for an application to clobber the OS by writing in the wrong area, or whatever. Any clues/suggestions/tips/criticisms/flames/whatever would be really appreciated. Thanks M. Baker homxc!mrb1 201-949-3455