Path: utzoo!mnetor!uunet!cbmvax!daveh
From: daveh@cbmvax.UUCP (Dave Haynie)
Newsgroups: comp.sys.amiga
Subject: Re: 68030 Questions
Message-ID: <3291@cbmvax.UUCP>
Date: 10 Feb 88 06:01:56 GMT
References: <4822@videovax.Tek.COM>
Organization: Commodore Technology, West Chester, PA
Lines: 108

in article <4822@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
Summary: DMA is still *FAST*er
> Summary: DMA is the *SLOW* way to go!

> In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:

>> This happens all the time with things like hard disk drives.  It sure does
>> hurt the 68000's speed, but consider the alternative.  You've got to get
>> that disk data into memory somehow.  If you make the 68000 go and read it
>> from an I/O port somewhere, you're running several memory cycles per data
>> transfer.  I mean, instruction fetch, I/O fetch, instruction fetch, write to
>> RAM, instruction fetch, test and branch, something like that.  Once a DMA
>> driven controller is set up (simple, nothing like setting up the blitter),
>> you have a bus arbitration, then one word transferred by the controller per
>> memory cycle.  If you're a 68020, you may even run a little from cache after
>> the arbitration.  So this is much faster than possible without DMA.

> This is true for a 68000 or 68010, and perhaps even for a 68020 or 68030 on
> a 16-bit-wide bus.  However, for best performance you want to put the DMA
> peripherals on one side of a dual-ported memory and let the CPU do the
> data moving.  

No, what you want is intelligently designed peripherals.

> Why?  The reasons are as follows:

>   1. Most DMA peripherals are incredibly sluggish...

>      To keep up with the Ethernet, the LANCE will arbitrate for the
>      bus about every 12.8 microseconds, tying it up for 5.1 microseconds
>      minimum.  This is about 40% of the bus bandwidth.

This is why we have things like FIFOs.  Even the 68020 running with cache 
enabled typically uses only around 50% of the bus bandwidth.  This is not
a bad thing, though, but a good argument for DMA.

>   2. On a 32-bit bus, the 68020 can move data very efficiently -- once the
>      instructions have been loaded into the cache, the only thing on the
>      bus will be (32-bit) data transfers.  Even with reasonably slow
>      memory (180-nanosecond access, 300-nanosecond cycle time), this means
>      that the 68020 can transfer data twice as fast as a LANCE running
>      on 100-nanosecond access memory.

Like I said, intelligently designed peripherals.  Let's look at a hard disk
controller with FIFO.  The Amiga 2090 controller is such a beast.  Though
only a 16 bit device, the same principals work in 32 bit land. 

So my hard disk controller is chugging away, fetching data from the relatively
slow hard disk and stuffing this in the FIFO.  It sees the FIFO filling up, 
and interrupts the 68020.  The '020 springs to action, being that the disk is
run by a high priority task that was just waiting on this interrupt.  So far
we're have to do this whether the disk controller is DMA or shared memory.

Now let's consider the shared memory.  Say we've got 512 bytes to move.  You
jump into a block move routine, where the cache immediately gets set up with
the move code after the first loop pass.  You've got one memory cycle to read
the data from shared RAM, one memory cycle to stuff it into your destination
RAM.  So you get 256 memory cycles, plus maybe 2 extra for cache setup.

Now we go to the DMA controller, moving the same 512 bytes.  We have to set 
up the controller with the destination RAM address, that should take maybe
3 cycles.  Give it another 3 to tell the DMA controller to go ahead.  Next,
maybe a cycle to arbitrate the bus.  Now we run the DMA transfer.  But we
already have the data at hand, so all the controller has to do is stuff it
in memory.  That's 128 memory cycles.  And another to re-arbitrate.

So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
by itself.

> If you dual-port the LANCE memory properly (32 bits wide to the 68020,
> 16 bits wide to the LANCE), you can move the data from the dual-ported
> memory *while* the LANCE is transferring other data into it, thus
> achieving an effective doubling of the transfer rate and freeing the
> bus for other purposes the rest of the time.

I get the exact same effect with my FIFO, only through use of DMA I'm tying
up the bus much less.

But not really, unless you've got some screaming RAM in that dual port 
section.  Maybe you can use some true dual-ported SRAM, or a FIFO like
what we've got on this hard disk controller, but if you're talking DRAM,
forget it, the 68020's going to eat all the available time on anything
in the 80ns or slower range.

> So, for maximum performance, hide your peripherals behind dual-ported
> memory, and then mark those pages as "non-cacheable."

There's no question that having a peripheral device dump to shared RAM
is much better than directly banging it with the CPU, Macintosh style.  And
for very small tranfer situations, it's better.  A DMA controller has a 
fixed setup time.  But if you're transferring more than a few bytes at a
time, DMA is a win.  And unless you're dealing with something that needs
immediate response (eg, you can't wait until you've got 64 or 512 or 
whatever bytes to block transfer), DMA is still a win on a 68020 system,
if done correctly.  The 68020 at 32 bits/transfer will tie a 16 bit DMA
device at transfer rate, plus it's got less setup, so you definitely want
that DMA to be 32 bits wide.

Finally, in a decent system, you can have DMA on your backplane going at
the same time you've got CPU access going on you're local bus, so the DMA
won't always kick the CPU off the bus.  Amiga's aren't doing it this way,
yet.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"