Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!tektronix!tekcrl!tekfdi!videovax!stever
From: stever@videovax.Tek.COM (Steven E. Rice, P.E.)
Newsgroups: comp.sys.amiga
Subject: Re: 68030 Questions
Message-ID: <4853@videovax.Tek.COM>
Date: 19 Feb 88 23:17:34 GMT
References: <4822@videovax.Tek.COM> <3291@cbmvax.UUCP>
Reply-To: stever@videovax.Tek.COM (Steven E. Rice, P.E.)
Organization: Tektronix Television Systems, Beaverton, Oregon
Lines: 234
Keywords: DMA, closet
Summary: DMA is great -- in its proper place. . .

Hmmmm. . .  I expressed my belief that (at least in a 32-bit wide 68020
system) "DMA is the *SLOW* way to go!"  In article <3291@cbmvax.UUCP>,
Dave Haynie (daveh@cbmvax.UUCP) replied:

> Summary: DMA is still *FAST*er

Now, don't get me wrong -- I'm not suggesting that we go back to the bad
old days of "programmed data transfers" (i.e., interrupt-per-byte transfers,
with the CPU stacking and unstacking its entire context for each byte that
comes in or goes out).  Long, long ago, in a galaxy far, far away, I did
that with a 6800 (our options were limited).  Maximum data transfer rate
was about 20K bytes per second, using every CPU cycle that was available.

However, I will continue to insist that there are some things that are
not fit for genteel company, and should be relegated to an appropriate
closet.  And right at the top of my list of such things is DMA I/O!!!

In my previous article, I suggested:

>>              . . .  However, for best performance you want to put the DMA
>> peripherals on one side of a dual-ported memory and let the CPU do the
>> data moving.

Dave disagreed:

> No, what you want is intelligently designed peripherals.

(AMD may be bent out of shape at such calumnies!!)  But I would suggest
that the reasons I gave are valid:

>> Why?  The reasons are as follows:
> 
>>   1. Most DMA peripherals are incredibly sluggish...
> 
>>      To keep up with the Ethernet, the LANCE will arbitrate for the
>>      bus about every 12.8 microseconds, tying it up for 5.1 microseconds
>>      minimum.  This is about 40% of the bus bandwidth.
> 
> This is why we have things like FIFOs.  Even the 68020 running with cache 
> enabled typically uses only around 50% of the bus bandwidth.  This is not
> a bad thing, though, but a good argument for DMA.

I guess I wasn't being quite as explicit as I should have been!  First, the
LANCE contains its own FIFO (they call it a "SILO").  Second, when I was
talking about the LANCE taking up 40% of the bus bandwidth, I didn't 
relate it to the transfer efficiency.  So, let me give an example I know
well -- our system:

  -- 16.67 MHz 68020 on a 32-bit wide bus.

  -- Actual memory access time about 240 nsec (from assertion of AS' to
     the CPU responding to DSACKx' by un-asserting AS').  Full memory cycle
     time about 330 nsec (provides RAS' precharge time).  Memory is
     asynchronous to the processor.

  -- LANCE Ethernet interface behind a 128K byte dual-ported memory which
     is organized as 32K x 32 bits from the 68020's perspective and
     64K x 16 bits from the LANCE's perspective.

The LANCE (along with its companion, the SIA) is an integrated solution
to Ethernet interfacing.  The LANCE manages its own "rings" of input
and output buffers, discriminates against messages that aren't intended
for it (it recognizes when it is addressed), and performs all the
housekeeping functions associated with Ethernet packet creation and
validation.  Thus, the LANCE can receive and store a complete (maximum
length 1536 octet) Ethernet packet before it pulls the CPU's chain.

For all that it interfaces to a fast bus (Ethernet is 10 Mbits/sec data
transfer rate), the LANCE has some disadvantages.  It has a minimum 600
nsec data transfer time with 100 nsec memory.  With our memory, which
responds in about 240 nsec, the LANCE would have an 800 nsec nominal
data transfer cycle.  Thus, the LANCE would transfer 8, 16-bit words
(one SILO full) every 12.8 microseconds, tying up the CPU bus for about
6.7 microseconds, which is 52% of the available CPU bus bandwidth.

The LANCE can transfer only 16 bits with each memory cycle.  Thus, its
data transfer rate, during the time it is using the bus, is:

    (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second

On the other hand, in our system the 68020 has an effective data transfer
rate (once the cache is loaded with the instructions) of:

    (1 long word) * (32 bits) / (330 nanoseconds) =  96 Mbits/second

If you cut that in half to reflect the fact that the 68020 has to both
pick the (32-bit long word) up and store it away, it still has a data
transfer rate of 48 Mbits/sec, which is over twice that of the LANCE.

>>   2. On a 32-bit bus, the 68020 can move data very efficiently -- once the
>>      instructions have been loaded into the cache, the only thing on the
>>      bus will be (32-bit) data transfers.  Even with reasonably slow
>>      memory (180-nanosecond access, 300-nanosecond cycle time), this means
>>      that the 68020 can transfer data twice as fast as a LANCE running
>>      on 100-nanosecond access memory.
> 
> Like I said, intelligently designed peripherals.  Let's look at a hard disk
> controller with FIFO.  The Amiga 2090 controller is such a beast.  Though
> only a 16 bit device, the same principals work in 32 bit land.

Most principals work in schools. . .

> So my hard disk controller is chugging away, fetching data from the
> relatively slow hard disk and stuffing this in the FIFO.  It sees the FIFO
> filling up, and interrupts the 68020.  The '020 springs to action, being
> that the disk is run by a high priority task that was just waiting on this
> interrupt.  So far we're have to do this whether the disk controller is
> DMA or shared memory.
> 
> Now let's consider the shared memory.  Say we've got 512 bytes to move.  You
> jump into a block move routine, where the cache immediately gets set up with
> the move code after the first loop pass.  You've got one memory cycle to read
> the data from shared RAM, one memory cycle to stuff it into your destination
> RAM.  So you get 256 memory cycles, plus maybe 2 extra for cache setup.
> 
> Now we go to the DMA controller, moving the same 512 bytes.  We have to set 
> up the controller with the destination RAM address, that should take maybe
> 3 cycles.  Give it another 3 to tell the DMA controller to go ahead.  Next,
> maybe a cycle to arbitrate the bus.  Now we run the DMA transfer.  But we
> already have the data at hand, so all the controller has to do is stuff it
> in memory.  That's 128 memory cycles.  And another to re-arbitrate.
> 
> So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
> by itself.

Now, let's come back down to earth!  We (Tektronix Television Systems) have
a 68020-based professional television measurement instrument (the VM700)
that is just about ready to ship to customers.  It is 32 bits wide all over
the place, for maximum data transfer rate consistent with reasonable cost.
While I will admit that the principle of the A2090 would work just fine if
one could only do it 32 bits wide, in fact it is not (reasonably) possible
for us to do it 32 bits wide!

Why?  Well, we will probably ship about as many instruments in one year
as Commodore ships Amigas in a week.  (Not bad for an instrument with a
sticker price of $16,495!)  So, I can't afford to go out and generate a
32-bit wide DMA chip with a 512-byte onboard FIFO.  I have to use what I
can buy from Motorola or Hitachi or whomever.

Believe me, we did look at DMA chips before making the basic system
design decisions -- and the DMA chips are nearly as bad as the LANCE!
Minimum DMA cycle time I found was 500 nsec, again assuming nearly
instantaneous memory response.  And the best of them were only 16 bits
wide.

>> If you dual-port the LANCE memory properly (32 bits wide to the 68020,
>> 16 bits wide to the LANCE), you can move the data from the dual-ported
>> memory *while* the LANCE is transferring other data into it, thus
>> achieving an effective doubling of the transfer rate and freeing the
>> bus for other purposes the rest of the time.
> 
> I get the exact same effect with my FIFO, only through use of DMA I'm tying
> up the bus much less.
> 
> But not really, unless you've got some screaming RAM in that dual port 
> section.  Maybe you can use some true dual-ported SRAM, or a FIFO like
> what we've got on this hard disk controller, but if you're talking DRAM,
> forget it, the 68020's going to eat all the available time on anything
> in the 80ns or slower range.

Remember, our system memory access is about 240 nsec (asynchronous).  The
dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips,
and a boatload of SSI, MSI, and PALs.  The static parts are garden-variety,
150 nsec parts, but the actual memory access time is about 240 nsec,
because there is clock-driven, no-deadlock, positive arbitration logic to
ensure that one and only one customer gets the memory at a time [it works,
too! 8^) ].  (Signetics now has a chip that allows you to do the same thing
with dynamic RAMs -- it even takes care of the refresh!)

Because of this, the LANCE can access memory once per 800 nsec (or so),
and the 68020 can get one or two 32-bit accesses in between each of the
LANCE's 16-bit accesses.  Remember, too, that while the LANCE has the
bus, its effective data rate is about 19.1 Mbits/second.  Thus, even
with the 68020 having to read the data from the dual-port RAM on one
memory cycle and write it to system memory on the next memory cycle,
the effective data transfer bandwidth for the 68020 is 48 Mbits/second.

Thus, even without the rest of the argument, my conclusion is still:

>> So, for maximum performance, hide your peripherals behind dual-ported
>> memory, and then mark those pages as "non-cacheable."

Consider something else, though.  When you read from or write to your
hard disk, the CPU is going to have to copy the data at least once.  On
a read from the disk, you do a getchr() (or whatever), which stimulates
the system to go read a sector into a buffer of its own.  Then (and only
then) it passes a byte back to you.

If the disk DMA transfer occurs on the system bus, the data moves over
that bus *twice* before it gets to the user.  On the other hand, if the
hard disk controller board has its own (dual-ported) memory, which is
accessible to the CPU, the DMA can transfer into dual-ported memory
without disturbing the CPU at all.  When the data is passed to the user,
it moves over the system bus only once.

> There's no question that having a peripheral device dump to shared RAM
> is much better than directly banging it with the CPU, Macintosh style.  And
> for very small tranfer situations, it's better.  A DMA controller has a 
> fixed setup time.  But if you're transferring more than a few bytes at a
> time, DMA is a win.  And unless you're dealing with something that needs
> immediate response (eg, you can't wait until you've got 64 or 512 or 
> whatever bytes to block transfer), DMA is still a win on a 68020 system,
> if done correctly.  The 68020 at 32 bits/transfer will tie a 16 bit DMA
> device at transfer rate, plus it's got less setup, so you definitely want
> that DMA to be 32 bits wide.

Agreed that I want the DMA to be 32 bits wide.  That is just very
difficult for those of us that cannot crank up a silicon foundry whenever
we get the itch. . .

Note again, that in real life the processor is going to have to copy the
data somewhere else (to the ultimate consumer) once it is DMA-ed into the
system disk buffer.  There will be fewer transfers over the system bus
(and thus more cycles available to the CPU) if the DMA moves data from the
disk into dual-ported memory, so it must only pass over the system bus
once.

> Finally, in a decent system, you can have DMA on your backplane going at
> the same time you've got CPU access going on you're local bus, so the DMA
> won't always kick the CPU off the bus.  Amiga's aren't doing it this way,
> yet.

But Amigas will, I hope, I hope, I hope. . . 8^)

(By the way, if you've followed what I was saying, that's what we have
in the VM700 -- except the DMA runs on its own private "bus," and the
CPU *always* has the system bus available to it!)

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever