Xref: utzoo comp.unix.internals:1607 comp.unix.sysv386:3282
Path: utzoo!attcan!telly!lethe!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!usc!samsung!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.unix.internals,comp.unix.sysv386
Subject: Re: The performance implications of the ISA bus
Summary: Pseudo-DMA described, with short term scheduling.
Message-ID: <PCG.90Dec19145630@odin.cs.aber.ac.uk>
Date: 19 Dec 90 14:56:30 GMT
References: <O7Y77DB@xds13.ferranti.com> <PST.90Dec1131440@ack.Stanford.EDU>
	<PCG.90Dec10182430@odin.cs.aber.ac.uk> <18871@yunexus.YorkU.CA>
	<1990Dec11.225839.13167@ico.isc.com>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 160
Nntp-Posting-Host: odin
In-reply-to: dougp@ico.isc.com's message of 11 Dec 90 22:58:39 GMT

On 11 Dec 90 22:58:39 GMT, dougp@ico.isc.com (Doug Pintar) said:

dougp> First, the use of two ESDI controllers will swamp the system
dougp> before giving you much advantage.  Remember, standard AT
dougp> controllers interrupt the system once per SECTOR.  The interrupt
dougp> code must then push or pull 256 16-bit words to/from the
dougp> controller.

This need not be a big problem. I have had e-mail discussion of these
issues in the last few days, and I take advantage of your posting to
dispel some myths publicly.

The interrupt latency and sector transfer times are quite small. They,
combined, amount to two or three hundred microseconds at most (100 usec
interrupt latency plus time to transfer 512 bytes at 5MB/sec which is
another 100 usec) depending on CPU speed and kernel design.

The *real* problem is that most (all, I think) 386 UNIX disc (and tape!)
drivers are poorly written, as they do not use pseudo-DMA, a standard
technique of PDP/VAX drivers (it is even mentioned in the 4.3BSD Leffler
book). This is described a bit later in this article.

dougp> Given an ESDI raw transfer rate of 800 KB/sec (not unreasonable
dougp> for large blocks) that's 1600 interrupts per second, each with a
dougp> (not real fast, due to bus delays) 256-word PIO transfer.  Try
dougp> getting two of those going at once and the system drags down REAL
dougp> fast.

A *sustained* transfer rate of 800KB/sec., that is nearly 100% of peak
transfer rate, is extremely rare. If you are pounding really hard on the
disc you may get from each disk 300KB thru the filesystem in any given
second. This translates to 600 sectors per second; you can do a sector
in 200-300 microseconds, or say 4 sectors per millisecond, so we have an
overhead of 150 milliseconds per every second. 15% is high, but not
tragic.

dougp> I've tried it on a 20 MHz 386 and found at most a 50% improvement
dougp> in aggregate throughput using 2 ESDI controllers simultaneously.
dougp> At that point, you've got 100% of the CPU dedicated to doing I/O
dougp> and none to user code...

This is mostly because the driver is written so that each IO transaction
involves only one sector. Therefore for every sector the top half of the
driver starts the transaction, then sleeps, the bottom half gets
activated by the interrupt and wakeups the top half.

The sleep/wakeup between the top and bottom halves involves, on a busy
system, two context switches, which is already bad, and, most
importantly, calls the scheduler. There is a paper that shows that under
many UNIX ports the cost of a wakeup/sleep is not really that of the
context switches, but of the scheduler calls to decide who is going to
run next, as this takes 90% of the time of a process activation.

With pseudo-DMA the top and bottom halves of the disk driver communicate
via a queue; the top half inserts as many IO operations as it has in the
queue, marking those for whose completion it wants to be notified. The
bottom half will start the first operation in the queue, and then when
it gets the interrupt that signals it is complete, it will immediately
start the next and then, if the just completed operation was marked for
notify, it will wakeup the relevant top half (note that there can be as
many instances of the top half active as there are processes with IO
transactions outstanding, while there will be as many instances of the
bottom half as there are CPUs).

This mode of operation means that the bottom half can issue IO
operations as fast as the controller will take them, synchronously with
each interrupt, that each IO operation will have a small overhead
consisting of just the interrupt latency and sector transfer times, and
that the wakeup/sleep and reschedules will not only be needed,
asynchronously, for every IO transactions, which can well involve many
IO operations. This is simulating an intelligent controller in the
driver's bottom half.

A typical IO transaction will consist of an (implied) seek command and a
list of 4-8 sectors, usually contiguous, to be transferred. A block read
via the buffer cache will typically cause two IO transactions, one for
the sectors making up the current block, one for the read ahead block.

One can also do tricks in the scheduler to reduce the cost of a
reschedule. UNIX implementations are usually badly designed in this, but
one could use a technique used for MUSS (SwPract&Exp, Aug 1979).

The idea is to have a short term scheduler and a long term scheduler,
where UNIX normally has only a long term scheduler.  The short term
scheduler manages, in a deterministic way, e.g.  priority based or FIFO,
a fixed number of processes; the long term scheduler selectes,
periodically, which processes are in the short term scheduler set. The
real cost of scheduling is the policy decision of which processes are
eligible for scheduling. Normally this need only be changed fairly
rarely, and periodically, not on every context change.

Having a short term scheduler means that the cost of process switch is
only marginally higher than that of a context switch, because the short
term scheduler job is just to find the first ready-to-run process in a
fixed size list of maybe 16 entries.

A nice extra idea found in MUSS was to make the short term scheduler use
bitmap queues for strictly priority based scheduling; queues are words,
and each bit in a word represents a different process, and a different
priority. To add a process to a queue (e.g. the ready to run queue) one
just turns on its bit, and so on.

Ah, if only UNIX designers and implementors had one tenth of the insight
of the MUSS ones!

dougp> Two drives on a single AT-compatible controller will gain you
dougp> something in latency-reduction, as the HPDD does some cute tricks
dougp> to overlap seeks.

For a multiuser system, which is the scope of my posting, this is far
more important than bandwidth. Multiusers systems are seek-limited more
than bandwidth limited (for small timesharing multiuser systems, that
is).

dougp> Bus-mastering DMA SCSI adapters, like the Adaptec 154x (ISA) or
dougp> 1640 (MCA) provide MUCH better throughput.  They ARE
dougp> multi-threaded, and the HPDD will try to keep commands
dougp> outstanding on each drive it can use.  The major win is that the
dougp> entire transfer is controlled by the adapter, with host
dougp> intervention only when a transfer is complete.  You get lots more
dougp> USER cycles this way!

Yes, this is true in general. But there are twists to this argument. In
the pseudo-DMA technique described above, a multithreaded, hw DMA and
scatter gather controller is simulated by "lending" the main CPU to a
dumb controller; the bottom half of the disk driver becomes the
microcode of this "pseudo intelligent controller" and simulates the DMA
and the scatter gather.

The main CPU is usually *much* faster than the one that is actually put
in actual intelligent controllers (say 386 vs. 8086), so IO rates
_might_ be higher with a pseudo intelligent controller than a real one.
On the other hand the real intelligent controller can work in parallel
with the main CPU. In IO bound systems this is of course little or no
benefit (because there are CPU cycles to spare), unless there are
multiple intelligent controllers, which is rare.

dougp> I'm still not convinced that cacheing controllers are a big win
dougp> over a large Unix buffer cache.  I usually use 1-2 MB of cache,

Ah yes! Devoting to the cache 25% of available memory seems to be a good
rule of thumb.

dougp> and a couple-MB RAMdisk for /tmp if I have the memory available.

But /tmp should not be on a RAM disk, it should be in a normal
filesystem even if actually almost never causing IO transactions as
short lived files under /tmp should exist only in the cache.

Unfortunately the "hardening" features of the System V filesystem means
that even short lived files will be sync'ed out (at least the inodes),
but this can be partially obviated by tweaking tunable parameters. For
example enlarging substantially the inode cache (almost a simportant as
the block cache), and slowing down bdflush. Overall instead of having a
RAM disk for /tmp, I would devoted the core that would go to it instead
to enlarging the buffer and inode caches.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk