Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!oliveb!apple!amdahl!amdcad!rpw3 From: rpw3@amdcad.AMD.COM (Rob Warnock) Newsgroups: comp.dcom.modems Subject: Re: Telebit transfer rate problem Message-ID: <25049@amdcad.AMD.COM> Date: 31 Mar 89 05:18:54 GMT References: <640@island.uu.net> <64160@pyramid.pyramid.com> <1553@neoucom.UUCP> <1330@auspex.UUCP> Reply-To: rpw3@amdcad.UUCP (Rob Warnock) Organization: [Consultant] San Mateo, CA Lines: 106 In article <1330@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: +--------------- | >...Most of the tty drivers generate an CPU interrupt per character received. | That's not a software engineer's problem, that's a hardware engineer's | problem - the serial port hardware doesn't buffer up characters. +--------------- Well, it's also a Unix problem. Unix was originally written when terminals were *slooow*. (Can you still spell ASR-33?) Thus it never bothered anybody that the kernel cut all interrupts off for many milliseconds at a time. (The most serious offenders tend to be in the disk buffer cache search and in the time-of-day-crosses-a-second/minute/hour/day code.) Thus with higher speed lines, regardless of the *efficiency* (see below) of the TTY driver implementation, the rest of the kernel simply doesn't accommodate the low *latency* requirement of these speeds. +--------------- | >With Unix this is nasty, as it might mean that several context switches | >take place for each character received. | That has nothing to do with the number of interrupts; | The problem is that you get a *wakeup* for every character received; | that's where the context switches come from. There are at least two | ways around this: +--------------- ...and Guy describes what I call "pseudo-DMA with dallying", and VMIN/VTIME. These are both *efficiency* optimizations, and while quite worth it in terms of efficiency [esp. pseudo-DMA + dallying], they can't help high-speed character input if the rest of the kernel breaks the *latency* requirement. [To put some numbers on it, since most serial ports these days have 3, maybe as little as 2, bytes of buffering, you can tolerate the TTY interrupts being shut off for at *most* 3 character times, and 1 char time is safer. With 19200 baud async, that's about 1/2 millisecond. But I have seen "production" Unix kernels which held "spl_high()" for tens of milliseconds!] The solution is to fix the latency breakers, *then* apply the mentioned efficiency changes. An straightforward way to do that (known to many kernel hackers, but by no means all) I recently described at length in comp.arch, but for those who don't read that group, a condensed version: You split interrupt service into into a "first-level"/hardware-oriented/ assembly-language part, and a "second-level"/software-oriented/C-language part. You leave the "real" hardware interrupts always enabled (especially during 2nd-level handlers, system calls, etc.). When an interrupt occurs, all you do is clear the interrupting hardware, grab whatever really volatile data there might be [e.g., a just-received async character], and queue up a task block naming the 2nd-level handler to run -- if it's even needed ("soft"-DMA can often just stash the data in a buffer and dismiss). The Unix "splXXX()" [Set Priority Level] routines are modified to manipulate a *software* notion of priority, which is respected by the 2nd-level routines and system-call level code (but not the hardware), but they never turn off the *hardware* enables. Benefits: 1. The hardware interrupts are disabled only for the brief moment when a 1st-level handler is running. [You will be amazed how good your CPU's interrupt response time *really* is -- especially if it's one of the new RISCs. Even older CISCs can handle astounding numbers of interrupts per second. For example, a certain PDP8-based terminal front-end handled 10,000 chars/sec *through* the node, interrupt per char. 68000's do better. 29000's do *lots* better.] 2. The 1st-level tasks can usually be done in a few assembly instructions without saving very much CPU state; the 2nd-level tasks need a full C context, reentrant and "interruptable" -- a lot more state. Since interrupts are often "bursty", the two-level structure saves state *once* for several interrupts, a significant efficiency gain. In fact, interrupt handling gets more efficient the higher the interrupt rate. 3. Most interrupts from "character" devices can be handled entirely in the 1st-level handlers as "soft-DMA", or "pseudo-DMA", thus lessening further the number of full CPU state saves done. [This is the main benefit of Guy's first point.] Applying the above to a Version 7 Unix port to a 5.5 MHz 68000 (years ago), we were able to take a system which could hardly do a single 2400-baud UUCP and get it to cheerfully handle three simultaneous 9600-baud UUCPs! ...and with no change to the hardware: interrupt-per-character SIO chips. [Sadly, I must admit that the reason that same system could never do even *one* 19200-baud UUCP is that after we had achieved such a speedup, management wouldn't let us spend the time to find out where the remaining latency-breaker for 19200 was... somewhere in the once-a-second clock stuff, we thought. Thus my my Telebit is locked at 9600, not 19200. (*sigh*)] To hammer the point home, there are three conflicting goals in doing "real- time" work [and yes, Unix I/O *is* "real-time"!]: latency, efficiency, and throughput. UNLESS YOU ARE VERY CAREFUL and explicitly pay attention to "balance", efforts to improve one often have adverse effects on the others. Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403