Xref: utzoo comp.unix.internals:1607 comp.unix.sysv386:3282 Path: utzoo!attcan!telly!lethe!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!usc!samsung!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.unix.internals,comp.unix.sysv386 Subject: Re: The performance implications of the ISA bus Summary: Pseudo-DMA described, with short term scheduling. Message-ID: Date: 19 Dec 90 14:56:30 GMT References: <18871@yunexus.YorkU.CA> <1990Dec11.225839.13167@ico.isc.com> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 160 Nntp-Posting-Host: odin In-reply-to: dougp@ico.isc.com's message of 11 Dec 90 22:58:39 GMT On 11 Dec 90 22:58:39 GMT, dougp@ico.isc.com (Doug Pintar) said: dougp> First, the use of two ESDI controllers will swamp the system dougp> before giving you much advantage. Remember, standard AT dougp> controllers interrupt the system once per SECTOR. The interrupt dougp> code must then push or pull 256 16-bit words to/from the dougp> controller. This need not be a big problem. I have had e-mail discussion of these issues in the last few days, and I take advantage of your posting to dispel some myths publicly. The interrupt latency and sector transfer times are quite small. They, combined, amount to two or three hundred microseconds at most (100 usec interrupt latency plus time to transfer 512 bytes at 5MB/sec which is another 100 usec) depending on CPU speed and kernel design. The *real* problem is that most (all, I think) 386 UNIX disc (and tape!) drivers are poorly written, as they do not use pseudo-DMA, a standard technique of PDP/VAX drivers (it is even mentioned in the 4.3BSD Leffler book). This is described a bit later in this article. dougp> Given an ESDI raw transfer rate of 800 KB/sec (not unreasonable dougp> for large blocks) that's 1600 interrupts per second, each with a dougp> (not real fast, due to bus delays) 256-word PIO transfer. Try dougp> getting two of those going at once and the system drags down REAL dougp> fast. A *sustained* transfer rate of 800KB/sec., that is nearly 100% of peak transfer rate, is extremely rare. If you are pounding really hard on the disc you may get from each disk 300KB thru the filesystem in any given second. This translates to 600 sectors per second; you can do a sector in 200-300 microseconds, or say 4 sectors per millisecond, so we have an overhead of 150 milliseconds per every second. 15% is high, but not tragic. dougp> I've tried it on a 20 MHz 386 and found at most a 50% improvement dougp> in aggregate throughput using 2 ESDI controllers simultaneously. dougp> At that point, you've got 100% of the CPU dedicated to doing I/O dougp> and none to user code... This is mostly because the driver is written so that each IO transaction involves only one sector. Therefore for every sector the top half of the driver starts the transaction, then sleeps, the bottom half gets activated by the interrupt and wakeups the top half. The sleep/wakeup between the top and bottom halves involves, on a busy system, two context switches, which is already bad, and, most importantly, calls the scheduler. There is a paper that shows that under many UNIX ports the cost of a wakeup/sleep is not really that of the context switches, but of the scheduler calls to decide who is going to run next, as this takes 90% of the time of a process activation. With pseudo-DMA the top and bottom halves of the disk driver communicate via a queue; the top half inserts as many IO operations as it has in the queue, marking those for whose completion it wants to be notified. The bottom half will start the first operation in the queue, and then when it gets the interrupt that signals it is complete, it will immediately start the next and then, if the just completed operation was marked for notify, it will wakeup the relevant top half (note that there can be as many instances of the top half active as there are processes with IO transactions outstanding, while there will be as many instances of the bottom half as there are CPUs). This mode of operation means that the bottom half can issue IO operations as fast as the controller will take them, synchronously with each interrupt, that each IO operation will have a small overhead consisting of just the interrupt latency and sector transfer times, and that the wakeup/sleep and reschedules will not only be needed, asynchronously, for every IO transactions, which can well involve many IO operations. This is simulating an intelligent controller in the driver's bottom half. A typical IO transaction will consist of an (implied) seek command and a list of 4-8 sectors, usually contiguous, to be transferred. A block read via the buffer cache will typically cause two IO transactions, one for the sectors making up the current block, one for the read ahead block. One can also do tricks in the scheduler to reduce the cost of a reschedule. UNIX implementations are usually badly designed in this, but one could use a technique used for MUSS (SwPract&Exp, Aug 1979). The idea is to have a short term scheduler and a long term scheduler, where UNIX normally has only a long term scheduler. The short term scheduler manages, in a deterministic way, e.g. priority based or FIFO, a fixed number of processes; the long term scheduler selectes, periodically, which processes are in the short term scheduler set. The real cost of scheduling is the policy decision of which processes are eligible for scheduling. Normally this need only be changed fairly rarely, and periodically, not on every context change. Having a short term scheduler means that the cost of process switch is only marginally higher than that of a context switch, because the short term scheduler job is just to find the first ready-to-run process in a fixed size list of maybe 16 entries. A nice extra idea found in MUSS was to make the short term scheduler use bitmap queues for strictly priority based scheduling; queues are words, and each bit in a word represents a different process, and a different priority. To add a process to a queue (e.g. the ready to run queue) one just turns on its bit, and so on. Ah, if only UNIX designers and implementors had one tenth of the insight of the MUSS ones! dougp> Two drives on a single AT-compatible controller will gain you dougp> something in latency-reduction, as the HPDD does some cute tricks dougp> to overlap seeks. For a multiuser system, which is the scope of my posting, this is far more important than bandwidth. Multiusers systems are seek-limited more than bandwidth limited (for small timesharing multiuser systems, that is). dougp> Bus-mastering DMA SCSI adapters, like the Adaptec 154x (ISA) or dougp> 1640 (MCA) provide MUCH better throughput. They ARE dougp> multi-threaded, and the HPDD will try to keep commands dougp> outstanding on each drive it can use. The major win is that the dougp> entire transfer is controlled by the adapter, with host dougp> intervention only when a transfer is complete. You get lots more dougp> USER cycles this way! Yes, this is true in general. But there are twists to this argument. In the pseudo-DMA technique described above, a multithreaded, hw DMA and scatter gather controller is simulated by "lending" the main CPU to a dumb controller; the bottom half of the disk driver becomes the microcode of this "pseudo intelligent controller" and simulates the DMA and the scatter gather. The main CPU is usually *much* faster than the one that is actually put in actual intelligent controllers (say 386 vs. 8086), so IO rates _might_ be higher with a pseudo intelligent controller than a real one. On the other hand the real intelligent controller can work in parallel with the main CPU. In IO bound systems this is of course little or no benefit (because there are CPU cycles to spare), unless there are multiple intelligent controllers, which is rare. dougp> I'm still not convinced that cacheing controllers are a big win dougp> over a large Unix buffer cache. I usually use 1-2 MB of cache, Ah yes! Devoting to the cache 25% of available memory seems to be a good rule of thumb. dougp> and a couple-MB RAMdisk for /tmp if I have the memory available. But /tmp should not be on a RAM disk, it should be in a normal filesystem even if actually almost never causing IO transactions as short lived files under /tmp should exist only in the cache. Unfortunately the "hardening" features of the System V filesystem means that even short lived files will be sync'ed out (at least the inodes), but this can be partially obviated by tweaking tunable parameters. For example enlarging substantially the inode cache (almost a simportant as the block cache), and slowing down bdflush. Overall instead of having a RAM disk for /tmp, I would devoted the core that would go to it instead to enlarging the buffer and inode caches. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk