Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!haven!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.arch
Subject: Re: DMA on RISC-based systems
Message-ID: <17925@mimsy.UUCP>
Date: 7 Jun 89 01:37:12 GMT
References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <185@dg.dg.com>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 73

In article <185@dg.dg.com> rec@dg.dg.com (Robert Cousins) writes:
>Scenario two:  Small dedicated buffer.

>The buffer is 4K bytes long so the processor is no longer required
>to make as timely response as above.  The real issue is now the 
>copy time, of which there is two components:  transfer time and
>context overhead.  The transfer time will be limited by the memory/
>cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
>half of the transfer will involve a bus cycle in all cases.  Given a minor
>penalty of 4 or 5 instruction periods for this half and assuming a cache
>hit on the other side always, the code will look something like this:
>
>		ld	r1,4096
>		ld	r2,bufferaddress
>		ld	r3,destaddress
>	loop:	ldb	r4,(r2)		/	byte load = 4 clocks for miss
>		ldb	(r3),r4		/	byte store = 1 for cache hit
>		addi	r2,1		/	1
>		addi	r3,1		/	1
>		addi	r1,-1		/	1
>		brz	loop		/	1 (code could be reorg'd)
>
>	Total clocks required: 9*4096=36864 per block
>	* 50 blocks/ second = 1843200 clocks

This is not an unreasonable approach to analysing the time required for
the copies, but the code itself *is* unreasonable---it is more likely to
be something like

		ld	r1,4096/4
		lea	r2,dual_port_mem_addr
		lea	r3,dest_addr
	loop:	ld	r4,(r2)		/	4-byte load ...
		ld	(r3),r4		/	4-byte store
		addi	r2,4
		addi	r3,4
		addi	r1,-1
		brz	loop

which is four times faster than your version.  Still, 50 blocks/second is
much too slow, especially if the blocks are only 4 KB; modern cheap SCSI
disks deliver between 600 KB/s and 1 MB/s.  With 8 KB blocks, we should
expect to see between 75 and 125 blocks per second.

So we might change your 9% estimate to 4.5% (copy four times as fast,
but twice as often).  Nevertheless:

>Scenario three:  Stupid DMA.
>
>Here, the CPU just sets up the DMA and awaits completion.  The overhead
>is approximately 0 compared to the above examples.  

The overhead here is not zero.  It has been hidden.  The overhead lies in
the fact that dual ported main memory is expensive, so either the DMA
steals cycles that might be used by the CPU (and it can easily take about
half the cycles needed to do the copy in the Scenario two), or the main
memory costs more and/or is slower.

>Where does the DMA pay off given that all three examples have approximately
>identical throughput? ...
>DMA is preferable to the second choice whenever the cost of DMA is
>less than 9% of the cost of the CPU or less than the cost of speeding
>up the CPU by 9%.

You have converted `% of available cycles' to `% of cost' (in the first
half of the latter statement) and assumed a continuous range of price/
performance in both halves, neither of which is true.

(I happen to like DMA myself, actually.  But it does take more parts,
and those do cost....)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris