Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!lll-lcc!lll-winken!uunet!mcvax!hp4nl!phigate!prle!prles2!cst!meulenbr From: meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) Newsgroups: comp.os.minix Subject: bug in minix fs/devicce.c (causing deadlock) Message-ID: <443@prles2.UUCP> Date: 23 Apr 89 20:44:36 GMT Sender: nobody@prles2.UUCP Reply-To: meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) Organization: Centre for Software Technology, Philips Eindhoven Lines: 82 Hi, I've discovered the following problem in minix/st. I'm almost sure that it applies to minix/pc as well. symptoms: system hangs. debugging function keys work. caused by: a process which does frequent alarm calls. Also other interrupts can cause this behaviour. analysis: Whenever an interrupt occurs during an I/O action this action is cancelled and EINTR is returned on the system call initiating that action. This is done in the following way: from withing fs/pipe.c the routine rw_dev from fs/device.c is called to send a CANCEL message to the task servicing the I/O action. Within rw_dev sendrec is called to send the message to the task in question. Sendrec can return E_LOCKED if there would occur deadlock by sending the message. If so rw_dev reads the message waiting and processes it (it can only be a revive message). Then a new try for sendrec is done. Hovever, if the message waiting is a completion message for the I/O action to be cancelled the next call of sendrec is a cancel message to a task which has no action in progress. The task does not reply and thus the fs will be waiting ad infinitum. fix: The problem can be fixed in several ways: - always give a reply back on the CANCEL message. however withing the pc tty driver a warning for races is given in the do_cancel routine. - check in the routine rw_dev for the condition described above and deal with it. I've chosen for the second approach. I include the routine rw_dev from fs/device.c below. No cdiff since my version may as well vary on other places. Good luck, Frans Meulenbroeks (.signature at end) PS: this was a tough one. /*===========================================================================* * rw_dev * *===========================================================================*/ PUBLIC void rw_dev(task_nr, mess_ptr) int task_nr; /* which task to call */ message *mess_ptr; /* pointer to message for task */ { /* All file system I/O ultimately comes down to I/O on major/minor device * pairs. These lead to calls on the following routines via the dmap table. */ int r; message m; while ((r = sendrec(task_nr, mess_ptr)) == E_LOCKED) { /* sendrec() failed to avoid deadlock. The task 'task_nr' is * trying to send a REVIVE message for an earlier request. * Handle it and go try again. */ if (receive(task_nr, &m) != OK) panic("rw_dev: can't receive", NO_NUM); /* if we're trying to send a cancel message to a task which just tries * to send a completion reply, ignore the reply and abort the cancel * request. The caller will do the revive for the process. * 22-4-89 F. Meulenbroeks */ if ((m.REP_PROC_NR == mess_ptr->PROC_NR) && (mess_ptr->m_type == CANCEL)) return; revive(m.REP_PROC_NR, m.REP_STATUS); } if (r != OK) panic("rw_dev: can't send", NO_NUM); } Frans Meulenbroeks (meulenbr@cst.prl.philips.nl) Centre for Software Technology ( or try: ...!mcvax!philmds!prle!cst!meulenbr)