Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!dg!rec
From: rec@dg.dg.com (Robert Cousins)
Newsgroups: comp.arch
Subject: Re: DMA on RISC-based systems
Summary: People of good will can disagree
Message-ID: <187@dg.dg.com>
Date: 8 Jun 89 12:48:08 GMT
References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <185@dg.dg.com> <17925@mimsy.UUCP>
Reply-To: uunet!dg!rec (Robert Cousins)
Organization: Data General, Westboro, MA.
Lines: 146

In article <17925@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <185@dg.dg.com> I write:
>>Scenario two:  Small dedicated buffer.
>
>>The buffer is 4K bytes long so the processor is no longer required
>>to make as timely response as above.  The real issue is now the 
>>copy time, of which there is two components:  transfer time and
>>context overhead.  The transfer time will be limited by the memory/
>>cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
>>half of the transfer will involve a bus cycle in all cases.  
>> 	[ code fragment using byte loads and stores excerpted ]
>>	Total clocks required: 9*4096=36864 per block
>>	* 50 blocks/ second = 1843200 clocks
>
>This is not an unreasonable approach to analysing the time required for
>the copies, but the code itself *is* unreasonable---it is more likely to
>be something like
>	[ code excerpted -- uses 4 byte loads and stores ]
>
>which is four times faster than your version.  

Actually, you have created a scenario 2.5.  I was making the assumption
that cost was a driving factor here which will rule out the use of real
two ported RAMs and 32 bit wide data paths.  The increase in peripheral
complexity is substantial (there aren't many 32 bit peripherals yet, but
will be soon! :-)) along with the cost of RAM.  

However, this scenario should be treated as reasonably as the rest.  
The equation of reference is:

	CPU Cost + IO Scheme cost
	------------------------- = $/deliverable compute unit
	CPU speed - IO overhead

I use percentages simply to avoid arguements about what reasonable 
units are.  For your suggestion to be true, the following inequality 
must hold:

	CPU Cost + 32-bit Buffer Cost	   CPU Cost + DMA Cost
	----------------------------- < ------------------------
	CPU Speed - Buffer Overhead     CPU Speed - DMA Overhead

Which is approximately equal to (when converting speed to percent):

	CPU Cost + 32-bit Buffer Cost	   CPU Cost + DMA Cost
	----------------------------- < ------------------------
		 95.5%     		  	~100%

or
	1 * (CPU cost + 32-bit Buffer Cost) < .955 * (CPU Cost + DMA Cost)
or
	.045 * CPU cost + 32-bit buffer cost < .955 DMA cost

which is clearly dominated by the CPU cost.  If the CPU cost is
simply $100, DMA wins if it costs less than about $5 more than the
buffer.

>Still, 50 blocks/second is
>much too slow, especially if the blocks are only 4 KB; modern cheap SCSI
>disks deliver between 600 KB/s and 1 MB/s.  With 8 KB blocks, we should
>expect to see between 75 and 125 blocks per second.

The purpose of the 50 blocks assumption was to estimate average CPU demand
for support of I/O, not for peak situations.  Relatively few machines of
the low end class will be used at 1 mb/s continuously.

>So we might change your 9% estimate to 4.5% (copy four times as fast,
>but twice as often).  Nevertheless:

>>Scenario three:  Stupid DMA.

>>Here, the CPU just sets up the DMA and awaits completion.  The overhead
>>is approximately 0 compared to the above examples.  

>The overhead here is not zero.  It has been hidden.  The overhead lies in
>the fact that dual ported main memory is expensive, so either the DMA
>steals cycles that might be used by the CPU (and it can easily take about
>half the cycles needed to do the copy in the Scenario two), or the main
>memory costs more and/or is slower.

Almost any product we are talking about will have a Cache (or two) with
a reasonable hit rate which will allow DMA activity to take place with
little or no performance impact.  In fact, the major reason for speeding
up RAM is to improve processor performance for cache line loads, not for
improved peripheral performance.  

Anyway, few busses in the machines of this class have useable memory
bandwidths less than 25 megabytes/second sustainable indefinitely.  If
the CPU is hogging 90% of this, there is still 2.5 megabytes per second
available for I/O.  This adds up to a continuously active ethernet (1.25
MB/s) along with healthy disk bandwidth (1.25 megabytes/second).  Since
both of these are bursty, in reality, there is a greater amount of
instantaneously available bandwidth.

In an earlier life, designing a 64 processor 80386 machine (there is a 
working prototype somewhere but the company is no more :-(), I hit upon the
idea of predicting when a CPU will need bus cycles and using cycles
which were predicted not to be needed so that they could be used for I/O.  
On an 80386, it is possible to 100% predict bus cycle requirements with 
a small amount of logic by cheating.  My calculations showed that a 
16 Mhz 80386 would leave almost 10 megabytes per second of bandwidth 
unused which this method could tap for non-time critical I/O operations 
such as SCSI.  Time critical peripherals would have to take CPU cycles
if "free" cycles were not available within their time frame which would
not be very often.

>>Where does the DMA pay off given that all three examples have approximately
>>identical throughput? ...
>>DMA is preferable to the second choice whenever the cost of DMA is
>>less than 9% of the cost of the CPU or less than the cost of speeding
>>up the CPU by 9%.

>You have converted `% of available cycles' to `% of cost' (in the first
>half of the latter statement) and assumed a continuous range of price/
>performance in both halves, neither of which is true.

Actually, the true measure of a machine is the amount of work that it 
can do for an end user divided by the cost.  The user must define the
measure of work.  Since I'm not able to define what the user will use to
measure the machine, I must substitute a rough approximation -- deliverable
CPU power in the form of MIPS, Dhrystones, or whatever.  This value is
directly tailorable by a number of factors in the system.  Slowing down
RAM can drop cost and performance.  Sometimes it improves the ratio, 
sometimes it doesn't.  While there is not a "continous" or even
"twice differentiable" curve here, there are so many points on it that
for the purposes of this discussion it can be assumed to be a line.
For each price point, there is an associated performance level.  Obviously,
plotting each price point vs each performance point does not yield a line,
but a cloud of points.  However, these points are easily reduceable into
a family of general lines based upon CPU clock speed, DRAM speed, 
peripherals and data path size among others.

>(I happen to like DMA myself, actually.  But it does take more parts,
>and those do cost....)
>-- 
>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
>Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

I happen to like low cost myself and have been suprized when certain
solutions turned out to be cheaper than others in counterintuitive ways.

Robert Cousins
Dept. Mgr, Workstation Dev't
Data General Corp.

Speaking for myself alone.