Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!oliveb!apple!amdahl!amdcad!rpw3
From: rpw3@amdcad.AMD.COM (Rob Warnock)
Newsgroups: comp.dcom.modems
Subject: Re: Telebit transfer rate problem
Message-ID: <25049@amdcad.AMD.COM>
Date: 31 Mar 89 05:18:54 GMT
References: <640@island.uu.net> <64160@pyramid.pyramid.com> <1553@neoucom.UUCP> <1330@auspex.UUCP>
Reply-To: rpw3@amdcad.UUCP (Rob Warnock)
Organization: [Consultant] San Mateo, CA
Lines: 106

In article <1330@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
+---------------
| >...Most of the tty drivers generate an CPU interrupt per character received.
| That's not a software engineer's problem, that's a hardware engineer's
| problem - the serial port hardware doesn't buffer up characters.
+---------------

Well, it's also a Unix problem. Unix was originally written when terminals
were *slooow*. (Can you still spell ASR-33?) Thus it never bothered anybody
that the kernel cut all interrupts off for many milliseconds at a time.
(The most serious offenders tend to be in the disk buffer cache search
and in the time-of-day-crosses-a-second/minute/hour/day code.)

Thus with higher speed lines, regardless of the *efficiency* (see below)
of the TTY driver implementation, the rest of the kernel simply doesn't
accommodate the low *latency* requirement of these speeds.

+---------------
| >With Unix this is nasty, as it might mean that several context switches
| >take place for each character received.
| That has nothing to do with the number of interrupts;
| The problem is that you get a *wakeup* for every character received;
| that's where the context switches come from.  There are at least two
| ways around this:
+---------------

...and Guy describes what I call "pseudo-DMA with dallying", and VMIN/VTIME.

These are both *efficiency* optimizations, and while quite worth
it in terms of efficiency [esp. pseudo-DMA + dallying], they can't
help high-speed character input if the rest of the kernel breaks the
*latency* requirement.

[To put some numbers on it, since most serial ports these days have
3, maybe as little as 2, bytes of buffering, you can tolerate the TTY
interrupts being shut off for at *most* 3 character times, and 1 char
time is safer. With 19200 baud async, that's about 1/2 millisecond.
But I have seen "production" Unix kernels which held "spl_high()" for
tens of milliseconds!]

The solution is to fix the latency breakers, *then* apply the mentioned
efficiency changes. An straightforward way to do that (known to many kernel
hackers, but by no means all) I recently described at length in comp.arch,
but for those who don't read that group, a condensed version:

You split interrupt service into into a "first-level"/hardware-oriented/
assembly-language part, and a "second-level"/software-oriented/C-language
part. You leave the "real" hardware interrupts always enabled (especially
during 2nd-level handlers, system calls, etc.). When an interrupt occurs,
all you do is clear the interrupting hardware, grab whatever really volatile
data there might be [e.g., a just-received async character], and queue up
a task block naming the 2nd-level handler to run -- if it's even needed
("soft"-DMA can often just stash the data in a buffer and dismiss).

The Unix "splXXX()" [Set Priority Level] routines are modified to manipulate
a *software* notion of priority, which is respected by the 2nd-level routines
and system-call level code (but not the hardware), but they never turn off
the *hardware* enables.

Benefits:

1. The hardware interrupts are disabled only for the brief moment when a
   1st-level handler is running.
   
[You will be amazed how good your CPU's interrupt response time *really*
is -- especially if it's one of the new RISCs. Even older CISCs can handle
astounding numbers of interrupts per second. For example, a certain PDP8-based
terminal front-end handled 10,000 chars/sec *through* the node, interrupt
per char. 68000's do better. 29000's do *lots* better.]

2. The 1st-level tasks can usually be done in a few assembly instructions
   without saving very much CPU state; the 2nd-level tasks need a full
   C context, reentrant and "interruptable" -- a lot more state. Since
   interrupts are often "bursty", the two-level structure saves state
   *once* for several interrupts, a significant efficiency gain. In fact,
   interrupt handling gets more efficient the higher the interrupt rate.

3. Most interrupts from "character" devices can be handled entirely in
   the 1st-level handlers as "soft-DMA", or "pseudo-DMA", thus lessening
   further the number of full CPU state saves done. [This is the main
   benefit of Guy's first point.]

Applying the above to a Version 7 Unix port to a 5.5 MHz 68000 (years ago),
we were able to take a system which could hardly do a single 2400-baud UUCP
and get it to cheerfully handle three simultaneous 9600-baud UUCPs! ...and
with no change to the hardware: interrupt-per-character SIO chips.

[Sadly, I must admit that the reason that same system could never do even
*one* 19200-baud UUCP is that after we had achieved such a speedup, management
wouldn't let us spend the time to find out where the remaining latency-breaker
for 19200 was... somewhere in the once-a-second clock stuff, we thought.
Thus my my Telebit is locked at 9600, not 19200. (*sigh*)]

To hammer the point home, there are three conflicting goals in doing "real-
time" work [and yes, Unix I/O *is* "real-time"!]: latency, efficiency, and
throughput. UNLESS YOU ARE VERY CAREFUL and explicitly pay attention to
"balance", efforts to improve one often have adverse effects on the others.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403