Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!dg!rec
From: rec@dg.dg.com (Robert Cousins)
Newsgroups: comp.arch
Subject: Re: DMA on RISC-based systems
Message-ID: <185@dg.dg.com>
Date: 5 Jun 89 13:31:28 GMT
References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <1989May31.163057.543@utzoo.uucp> <3480@orca.WV.TEK.COM>
Reply-To: uunet!dg!rec (Robert Cousins)
Organization: Data General, Westboro, MA.
Lines: 103

In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes:
>[]

>	"basic requirements ... which must be met to be considered
>	state-of-the art ... [include] dedicated LAN controllers to
>	handle the low levels of the LAN protocol ..."

>If you're going to buy into off-CPU agents to move I/O around, make
>sure that those agents will improve as fast as the CPU, or your future
>generation machines will be crippled.
>
>  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
>                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

I agree, however, I fear that some people misunderstood my point concerning
"RDA of hardware support."  There are a number of ways to produce brain-damaged
hardware.  For example, Seeq makes an Ethernet controller chip which requires
external DMA support.  If a dumb DMA channel (or no DMA) is used, the
lowest levels of software will end up being exceptionally complex since
all of the buffer management and scatter/gather will be in software.  THere
is also danger of droping packets on the floor which has nasty implications
for performance. :-)    If, however, some slightly more reasonable DMA is 
supplied (similar to the LANCE, or Intel chip's) the software complexity 
drops substantially.  

WHile I never intended my comments would imply INTELLIGENT control, it is
worthwhile to add it to the discussion.  At DG, our experience is that
it is possible to provide DMA services at prices below competitive non-DMA
products.  Does this mean that the DMA products run faster than the non-DMA
ones?  Often the peripherals are the limiting factor.  However, the following
analysis may be enlightening:

Scenario one:  Programmed I/O.

Given that a disk channel will be averaging 200K bytes/second in 
4K byte bursts 20 milliseconds apart using a 1 megabyte/second SCSI
channel, the time required to transfer the data will be SCSI limited 
(given a CPU of > ~3 MIPS).  However, since each byte takes 1 microsecond,
the CPU will be forced to be dedicated to the SCSI channel for 4 milliseconds
each tenure, 50 times per second for a total of 200 milliseconds each
second.  This has cost the user 20% of the available computing power.

Scenario two:  Small dedicated buffer.

The buffer is 4K bytes long so the processor is no longer required
to make as timely response as above.  The real issue is now the 
copy time, of which there is two components:  transfer time and
context overhead.  The transfer time will be limited by the memory/
cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
half of the transfer will involve a bus cycle in all cases.  Given a minor
penalty of 4 or 5 instruction periods for this half and assuming a cache
hit on the other side always, the code will look something like this:

		ld	r1,4096
		ld	r2,bufferaddress
		ld	r3,destaddress
	loop:	ldb	r4,(r2)		/	byte load = 4 clocks for miss
		ldb	(r3),r4		/	byte store = 1 for cache hit
		addi	r2,1		/	1
		addi	r3,1		/	1
		addi	r1,-1		/	1
		brz	loop		/	1 (code could be reorg'd)

	Total clocks required: 9*4096=36864 per block
	* 50 blocks/ second = 1843200 clocks

Given a CPU speed of 20 Mhz, this translates into 9% of the CPU time.

If the CPU is required to perform the copy during an interrupt service,
there is the danger that lower priority interrupts may be lost.  If the
copy takes place in the top half of the driver, then task latency becomes
an issue.  The buffer will not be drained until after the task wakes up
and completes the copy.  On some Unix implementations, the task wake up
time can be long periods of time -- enough to impact upon total throughput.


Scenario three:  Stupid DMA.

Here, the CPU just sets up the DMA and awaits completion.  The overhead
is approximately 0 compared to the above examples.  

Where does the DMA pay off given that all three examples have approximately
identical throughput?  

DMA is preferable to the first choice whenever the cost of DMA is less
than 20% of the cost of the CPU or less than the cost of speeding up
the CPU by 20%.

DMA is preferable to the second choice whenever the cost of DMA is
less than 9% of the cost of the CPU or less than the cost of speeding
up the CPU by 9%.

I am the first to admit that these models are simplistic, but they
do represent valid considerations and reasonable approximations to 
to the actual solutions.

Comments?

Robert Cousins
Dept. Mgr, Workstation Dev't
Data General Corp.

Speaking for myself alone.