Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!dg!rec From: rec@dg.dg.com (Robert Cousins) Newsgroups: comp.arch Subject: Re: DMA on RISC-based systems Message-ID: <185@dg.dg.com> Date: 5 Jun 89 13:31:28 GMT References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <1989May31.163057.543@utzoo.uucp> <3480@orca.WV.TEK.COM> Reply-To: uunet!dg!rec (Robert Cousins) Organization: Data General, Westboro, MA. Lines: 103 In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes: >[] > "basic requirements ... which must be met to be considered > state-of-the art ... [include] dedicated LAN controllers to > handle the low levels of the LAN protocol ..." >If you're going to buy into off-CPU agents to move I/O around, make >sure that those agents will improve as fast as the CPU, or your future >generation machines will be crippled. > > -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] > (andrew%frip.wv.tek.com@relay.cs.net) [ARPA] I agree, however, I fear that some people misunderstood my point concerning "RDA of hardware support." There are a number of ways to produce brain-damaged hardware. For example, Seeq makes an Ethernet controller chip which requires external DMA support. If a dumb DMA channel (or no DMA) is used, the lowest levels of software will end up being exceptionally complex since all of the buffer management and scatter/gather will be in software. THere is also danger of droping packets on the floor which has nasty implications for performance. :-) If, however, some slightly more reasonable DMA is supplied (similar to the LANCE, or Intel chip's) the software complexity drops substantially. WHile I never intended my comments would imply INTELLIGENT control, it is worthwhile to add it to the discussion. At DG, our experience is that it is possible to provide DMA services at prices below competitive non-DMA products. Does this mean that the DMA products run faster than the non-DMA ones? Often the peripherals are the limiting factor. However, the following analysis may be enlightening: Scenario one: Programmed I/O. Given that a disk channel will be averaging 200K bytes/second in 4K byte bursts 20 milliseconds apart using a 1 megabyte/second SCSI channel, the time required to transfer the data will be SCSI limited (given a CPU of > ~3 MIPS). However, since each byte takes 1 microsecond, the CPU will be forced to be dedicated to the SCSI channel for 4 milliseconds each tenure, 50 times per second for a total of 200 milliseconds each second. This has cost the user 20% of the available computing power. Scenario two: Small dedicated buffer. The buffer is 4K bytes long so the processor is no longer required to make as timely response as above. The real issue is now the copy time, of which there is two components: transfer time and context overhead. The transfer time will be limited by the memory/ cache/CPU bottleneck. Since the buffer is not cacheable (by implication), half of the transfer will involve a bus cycle in all cases. Given a minor penalty of 4 or 5 instruction periods for this half and assuming a cache hit on the other side always, the code will look something like this: ld r1,4096 ld r2,bufferaddress ld r3,destaddress loop: ldb r4,(r2) / byte load = 4 clocks for miss ldb (r3),r4 / byte store = 1 for cache hit addi r2,1 / 1 addi r3,1 / 1 addi r1,-1 / 1 brz loop / 1 (code could be reorg'd) Total clocks required: 9*4096=36864 per block * 50 blocks/ second = 1843200 clocks Given a CPU speed of 20 Mhz, this translates into 9% of the CPU time. If the CPU is required to perform the copy during an interrupt service, there is the danger that lower priority interrupts may be lost. If the copy takes place in the top half of the driver, then task latency becomes an issue. The buffer will not be drained until after the task wakes up and completes the copy. On some Unix implementations, the task wake up time can be long periods of time -- enough to impact upon total throughput. Scenario three: Stupid DMA. Here, the CPU just sets up the DMA and awaits completion. The overhead is approximately 0 compared to the above examples. Where does the DMA pay off given that all three examples have approximately identical throughput? DMA is preferable to the first choice whenever the cost of DMA is less than 20% of the cost of the CPU or less than the cost of speeding up the CPU by 20%. DMA is preferable to the second choice whenever the cost of DMA is less than 9% of the cost of the CPU or less than the cost of speeding up the CPU by 9%. I am the first to admit that these models are simplistic, but they do represent valid considerations and reasonable approximations to to the actual solutions. Comments? Robert Cousins Dept. Mgr, Workstation Dev't Data General Corp. Speaking for myself alone.