Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!cs.utexas.edu!uunet!nuchat!lobster!siswat!buck From: buck@siswat.UUCP (A. Lester Buck) Newsgroups: comp.unix.internals Subject: Re: UNIX semantics do permit full support for asynchronous I/O Message-ID: <555@siswat.UUCP> Date: 2 Sep 90 07:48:48 GMT References: <60345@lanl.gov> <27619@nuchat.UUCP> <1990Sep1.185221.8718@eng.umd.edu> Organization: Photon Graphics, Houston Lines: 68 In article <1990Aug30.222226.20866@cbnewsm.att.com> lfd@cbnewsm.att.com (leland.f.derbenwick) writes: >In essentially any serious database application, a completed >write() to a raw disk is treated as a guarantee that the data >block has been _physically written to the device_. (This is >needed to ensure reliable transaction behavior in the presence >of potential system crashes.) Since your suggestion would void >that guarantee, it is not benign. Close, but not quite. The guarantee is that the _controller_ has accepted the data. If/when the bits actually hit the media is not fully under the control of the OS. Remember SCSI has a READ BUFFERED DATA command for error recovery. SCSI disks are coming with bigger caches all the time, and a power hit can take out a significant amount of data. If the database really must remain consistent, a UPS is probably required. As to Steve's idea, it has a certain elegance to recommend it. But its practical value is low. Sure, it can be made to have full Unix semantics, but at the price of the common case reducing almost exactly to synchronous I/O. Or imagine the case of an I/O server process sharing memory with dozens of clients. Each shared memory segment will have to keep a list of every process that must block on a page fault. The practical effect will be that an _arbitrary_ number of processes will potentially block for every I/O, instead of doing useful work in their own address spaces. This scheme falls into the general class of YANSUAIOM (Yet Another Non-Standard Unix Asynchronous I/O Mechanism), as do the schemes with ioctl's or select'ing on disk. What may be difficult to understand at this point, when Unix has not had a standard asynchronous I/O facility, is that we will program _differently_ when it is widely available. The semantics of I/O must change (broaden). The structure and flow of a program will be significantly different when it uses asynchronous I/O, in the same way that the availablility of real threads leads to new programming paradigms to take advantage of those facilities. We may have to look at schemes used in the realtime Unix versions, VMS (gag) or even MVS (gag!!), which have had asynchronous I/O facilities for up to decades, to adapt to this new mindset. The only reason one designs an asynchronous I/O facility is to efficiently overlap computation with I/O transfers, and that can take some careful thought to achieve maximum speedup. For example, Chris Torek recently traced the path of a raw synchronous I/O, which eventually sleeps in physio() in the context of the calling process. A large transfer will loop through physio, with a wakeup/sleep cycle for every chunk (limited by how much physical memory the OS wants to lock down at once). Each sleep/wakeup cycle is an expensive context switch, involving reloading the virtual memory state of the caller. But a fully asynchronous I/O scheme drags along enough state to start the next I/O chunk all within the driver interrupt routine, with the calling process completely out of context. Of course, it is a bit(!) more complicated if non-resident pages are found in the next chunk that needs to be page-locked... The POSIX.4 asynchronous I/O facilities are moving toward final ballot and present a rich set of asynchronous I/O primitives. These include the obvious aread/awrite, and listio, similar to readv/writev for synchronous transfers, which can fire off a large number of aio's at once and optionally be notified only when they are all complete. Iosuspend is a more advanced version of select that waits for completion of any operations in a list. The process can learn of I/O completion in at least four ways: 1) return codes written into the process' asynchronous I/O control block, 2) receiving a completely asynchronous "fixed" (queued, tagged) signal/event which runs a handler, 3) synchronously suspending for I/O completion (iosuspend), or 4) synchronously suspending or polling for the signal/event posting I/O completion. [Suspending is familiar, but the committee added polling, where a process can sleep until one of a selected signal/event class is posted while taking signal/events not being polled for completely asynchronously.] -- A. Lester Buck buck@siswat.lonestar.org ...!uhnix1!lobster!siswat!buck