Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!haven!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.arch Subject: Re: DMA on RISC-based systems Message-ID: <17925@mimsy.UUCP> Date: 7 Jun 89 01:37:12 GMT References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <185@dg.dg.com> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 73 In article <185@dg.dg.com> rec@dg.dg.com (Robert Cousins) writes: >Scenario two: Small dedicated buffer. >The buffer is 4K bytes long so the processor is no longer required >to make as timely response as above. The real issue is now the >copy time, of which there is two components: transfer time and >context overhead. The transfer time will be limited by the memory/ >cache/CPU bottleneck. Since the buffer is not cacheable (by implication), >half of the transfer will involve a bus cycle in all cases. Given a minor >penalty of 4 or 5 instruction periods for this half and assuming a cache >hit on the other side always, the code will look something like this: > > ld r1,4096 > ld r2,bufferaddress > ld r3,destaddress > loop: ldb r4,(r2) / byte load = 4 clocks for miss > ldb (r3),r4 / byte store = 1 for cache hit > addi r2,1 / 1 > addi r3,1 / 1 > addi r1,-1 / 1 > brz loop / 1 (code could be reorg'd) > > Total clocks required: 9*4096=36864 per block > * 50 blocks/ second = 1843200 clocks This is not an unreasonable approach to analysing the time required for the copies, but the code itself *is* unreasonable---it is more likely to be something like ld r1,4096/4 lea r2,dual_port_mem_addr lea r3,dest_addr loop: ld r4,(r2) / 4-byte load ... ld (r3),r4 / 4-byte store addi r2,4 addi r3,4 addi r1,-1 brz loop which is four times faster than your version. Still, 50 blocks/second is much too slow, especially if the blocks are only 4 KB; modern cheap SCSI disks deliver between 600 KB/s and 1 MB/s. With 8 KB blocks, we should expect to see between 75 and 125 blocks per second. So we might change your 9% estimate to 4.5% (copy four times as fast, but twice as often). Nevertheless: >Scenario three: Stupid DMA. > >Here, the CPU just sets up the DMA and awaits completion. The overhead >is approximately 0 compared to the above examples. The overhead here is not zero. It has been hidden. The overhead lies in the fact that dual ported main memory is expensive, so either the DMA steals cycles that might be used by the CPU (and it can easily take about half the cycles needed to do the copy in the Scenario two), or the main memory costs more and/or is slower. >Where does the DMA pay off given that all three examples have approximately >identical throughput? ... >DMA is preferable to the second choice whenever the cost of DMA is >less than 9% of the cost of the CPU or less than the cost of speeding >up the CPU by 9%. You have converted `% of available cycles' to `% of cost' (in the first half of the latter statement) and assumed a continuous range of price/ performance in both halves, neither of which is true. (I happen to like DMA myself, actually. But it does take more parts, and those do cost....) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris