Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!dg!rec From: rec@dg.dg.com (Robert Cousins) Newsgroups: comp.arch Subject: Re: DMA on RISC-based systems Summary: People of good will can disagree Message-ID: <187@dg.dg.com> Date: 8 Jun 89 12:48:08 GMT References: <46500067@uxe.cso.uiuc.edu> <181@dg.dg.com> <185@dg.dg.com> <17925@mimsy.UUCP> Reply-To: uunet!dg!rec (Robert Cousins) Organization: Data General, Westboro, MA. Lines: 146 In article <17925@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <185@dg.dg.com> I write: >>Scenario two: Small dedicated buffer. > >>The buffer is 4K bytes long so the processor is no longer required >>to make as timely response as above. The real issue is now the >>copy time, of which there is two components: transfer time and >>context overhead. The transfer time will be limited by the memory/ >>cache/CPU bottleneck. Since the buffer is not cacheable (by implication), >>half of the transfer will involve a bus cycle in all cases. >> [ code fragment using byte loads and stores excerpted ] >> Total clocks required: 9*4096=36864 per block >> * 50 blocks/ second = 1843200 clocks > >This is not an unreasonable approach to analysing the time required for >the copies, but the code itself *is* unreasonable---it is more likely to >be something like > [ code excerpted -- uses 4 byte loads and stores ] > >which is four times faster than your version. Actually, you have created a scenario 2.5. I was making the assumption that cost was a driving factor here which will rule out the use of real two ported RAMs and 32 bit wide data paths. The increase in peripheral complexity is substantial (there aren't many 32 bit peripherals yet, but will be soon! :-)) along with the cost of RAM. However, this scenario should be treated as reasonably as the rest. The equation of reference is: CPU Cost + IO Scheme cost ------------------------- = $/deliverable compute unit CPU speed - IO overhead I use percentages simply to avoid arguements about what reasonable units are. For your suggestion to be true, the following inequality must hold: CPU Cost + 32-bit Buffer Cost CPU Cost + DMA Cost ----------------------------- < ------------------------ CPU Speed - Buffer Overhead CPU Speed - DMA Overhead Which is approximately equal to (when converting speed to percent): CPU Cost + 32-bit Buffer Cost CPU Cost + DMA Cost ----------------------------- < ------------------------ 95.5% ~100% or 1 * (CPU cost + 32-bit Buffer Cost) < .955 * (CPU Cost + DMA Cost) or .045 * CPU cost + 32-bit buffer cost < .955 DMA cost which is clearly dominated by the CPU cost. If the CPU cost is simply $100, DMA wins if it costs less than about $5 more than the buffer. >Still, 50 blocks/second is >much too slow, especially if the blocks are only 4 KB; modern cheap SCSI >disks deliver between 600 KB/s and 1 MB/s. With 8 KB blocks, we should >expect to see between 75 and 125 blocks per second. The purpose of the 50 blocks assumption was to estimate average CPU demand for support of I/O, not for peak situations. Relatively few machines of the low end class will be used at 1 mb/s continuously. >So we might change your 9% estimate to 4.5% (copy four times as fast, >but twice as often). Nevertheless: >>Scenario three: Stupid DMA. >>Here, the CPU just sets up the DMA and awaits completion. The overhead >>is approximately 0 compared to the above examples. >The overhead here is not zero. It has been hidden. The overhead lies in >the fact that dual ported main memory is expensive, so either the DMA >steals cycles that might be used by the CPU (and it can easily take about >half the cycles needed to do the copy in the Scenario two), or the main >memory costs more and/or is slower. Almost any product we are talking about will have a Cache (or two) with a reasonable hit rate which will allow DMA activity to take place with little or no performance impact. In fact, the major reason for speeding up RAM is to improve processor performance for cache line loads, not for improved peripheral performance. Anyway, few busses in the machines of this class have useable memory bandwidths less than 25 megabytes/second sustainable indefinitely. If the CPU is hogging 90% of this, there is still 2.5 megabytes per second available for I/O. This adds up to a continuously active ethernet (1.25 MB/s) along with healthy disk bandwidth (1.25 megabytes/second). Since both of these are bursty, in reality, there is a greater amount of instantaneously available bandwidth. In an earlier life, designing a 64 processor 80386 machine (there is a working prototype somewhere but the company is no more :-(), I hit upon the idea of predicting when a CPU will need bus cycles and using cycles which were predicted not to be needed so that they could be used for I/O. On an 80386, it is possible to 100% predict bus cycle requirements with a small amount of logic by cheating. My calculations showed that a 16 Mhz 80386 would leave almost 10 megabytes per second of bandwidth unused which this method could tap for non-time critical I/O operations such as SCSI. Time critical peripherals would have to take CPU cycles if "free" cycles were not available within their time frame which would not be very often. >>Where does the DMA pay off given that all three examples have approximately >>identical throughput? ... >>DMA is preferable to the second choice whenever the cost of DMA is >>less than 9% of the cost of the CPU or less than the cost of speeding >>up the CPU by 9%. >You have converted `% of available cycles' to `% of cost' (in the first >half of the latter statement) and assumed a continuous range of price/ >performance in both halves, neither of which is true. Actually, the true measure of a machine is the amount of work that it can do for an end user divided by the cost. The user must define the measure of work. Since I'm not able to define what the user will use to measure the machine, I must substitute a rough approximation -- deliverable CPU power in the form of MIPS, Dhrystones, or whatever. This value is directly tailorable by a number of factors in the system. Slowing down RAM can drop cost and performance. Sometimes it improves the ratio, sometimes it doesn't. While there is not a "continous" or even "twice differentiable" curve here, there are so many points on it that for the purposes of this discussion it can be assumed to be a line. For each price point, there is an associated performance level. Obviously, plotting each price point vs each performance point does not yield a line, but a cloud of points. However, these points are easily reduceable into a family of general lines based upon CPU clock speed, DRAM speed, peripherals and data path size among others. >(I happen to like DMA myself, actually. But it does take more parts, >and those do cost....) >-- >In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) >Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris I happen to like low cost myself and have been suprized when certain solutions turned out to be cheaper than others in counterintuitive ways. Robert Cousins Dept. Mgr, Workstation Dev't Data General Corp. Speaking for myself alone.