Path: utzoo!mnetor!uunet!cbmvax!daveh
From: daveh@cbmvax.UUCP (Dave Haynie)
Newsgroups: comp.sys.amiga
Subject: Re: 68030 Questions
Message-ID: <3394@cbmvax.UUCP>
Date: 1 Mar 88 01:48:58 GMT
References: <4853@videovax.Tek.COM>
Organization: Commodore Technology, West Chester, PA
Lines: 134

in article <4853@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
> Keywords: DMA, closet
> Summary: DMA is great -- in its proper place. . .

> Hmmmm. . .  I expressed my belief that (at least in a 32-bit wide 68020
> system) "DMA is the *SLOW* way to go!"  In article <3291@cbmvax.UUCP>,
> Dave Haynie (daveh@cbmvax.UUCP) replied:

>> Summary: DMA is still *FAST*er

> In my previous article, I suggested:
> 
>>>              . . .  However, for best performance you want to put the DMA
>>> peripherals on one side of a dual-ported memory and let the CPU do the
>>> data moving.

Thus, re-creating a situation very much like the way the chip bus works.  Your
design forces memory typing (MEMF_CHIP, MEMF_LAN, MEMF_HARDDISK, etc.).

> I guess I wasn't being quite as explicit as I should have been!  ...
> Thus, the LANCE would transfer 8, 16-bit words (one SILO full) every 12.8
> microseconds, tying up the CPU bus for about 6.7 microseconds, which is
> 52% of the available CPU bus bandwidth.
> 
> The LANCE can transfer only 16 bits with each memory cycle.  

Here we going again with what I meant by intelligently designed peripherals.  If
you're on a 32 bit bus, your DMA should be 32 bits wide.  And you should use a
larger FIFO, like maybe 64-128 bytes.  If you can't do either or both of these,
than, as I showed before, you'll get better performance from a 68020 move.

> Thus, its data transfer rate, during the time it is using the bus, is:

>     (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second

No intelligence here!  Why would you take over the bus and then just 
sit there.  If you are only transferring 16 bits at a time, this should
give you half the 68020 rate, 48 Mbits/second, once arbitration has
taken place.  A big enough FIFO makes the arbitration time negligable.
Extend this to 32 bits wide and you're twice the 68020 rate.  If this can't
be done from a circuit point of view, either redesign the lan chip to make
effective use of DMA, or admit that it's a bad design.  If there are other
reasons, like software or user can't handle buffering delays, than this isn't
a good application for DMA, and we can turn our attention over to problems
that are well suited to DMA, like hard disk controllers.  But don't pan DMA
because it doesn't fit an arbitrary case on an arbitrary chip.

>> [Timing analysis removed]
>> 
>> So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
>> by itself.

> Now, let's come back down to earth!  

Naa, that's where IBM does their design work.

> While I will admit that the principle of the A2090 would work just fine if
> one could only do it 32 bits wide, in fact it is not (reasonably) possible
> for us to do it 32 bits wide!

> Why?  Well, we will probably ship about as many instruments in one year
> as Commodore ships Amigas in a week.  (Not bad for an instrument with a
> sticker price of $16,495!)  So, I can't afford to go out and generate a
> 32-bit wide DMA chip with a 512-byte onboard FIFO.  I have to use what I
> can buy from Motorola or Hitachi or whomever.

OK, but again, you shouldn't blast the concept of DMA just because you can't
use it in your particular situation.  We make lots of Amigas, and lots of 
custom chips.  Like the DMA chip on the A2090 card.  That's only 16 bit in
this case, but we're only dealing with a 16 bit bus.

> Remember, our system memory access is about 240 nsec (asynchronous).  The
> dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips,
> and a boatload of SSI, MSI, and PALs.  The static parts are garden-variety,
> 150 nsec parts, but the actual memory access time is about 240 nsec,
> because there is clock-driven, no-deadlock, positive arbitration logic to
> ensure that one and only one customer gets the memory at a time [it works,
> too! 8^) ].  (Signetics now has a chip that allows you to do the same thing
> with dynamic RAMs -- it even takes care of the refresh!)

Well, I make the memory cycle time of a 16.67 MHz 68020 at just under 180ns.
So you're slowing down already.  But obviously a DMA device has to follow the
same rules as the 68020.  Now we have this dual ported memory.  I certainly 
believe you can build an arbiter that'll allow access to the RAM by only one
customer at a time.  But what happens when they both want it?  It appears to
me that one of them is getting wait stated.  That's what I meant by having
very FAST memory there.  The FIFO scheme starts DMA before the FIFO is 
completely filled, so that it fills just a bit before the transfer is complete.
You get your chunk of memory DMAed at full bus speed, and you get it from the
disk as fast as it could be received.  Now with the dual port scheme, you can
start filling the shared RAM early, too, since your data isn't coming in at
full bus speeds.  But eventually you want the transfer to start.  If stuff is
still coming into that memory, your transfer is going to suffer unless the RAM
is very fast.  The Amiga's CHIP RAM, for instance, is twice the speed of the
68000 memory cycle, so once you're synced up with it, there are no wait states
in normal operation (eg, blitter's well behaved, graphics are medium 
resolutions).  So this is a good scheme.  If I ran memory at the same speed
as the 68000 memory cycle, I'd hit wait states all the time trying to access
CHIP RAM.  What you're describing would only work well if the shared memory
has relatively little truely shared access.

> Consider something else, though.  When you read from or write to your
> hard disk, the CPU is going to have to copy the data at least once.  On
> a read from the disk, you do a getchr() (or whatever), which stimulates
> the system to go read a sector into a buffer of its own.  Then (and only
> then) it passes a byte back to you.

No.  The latest Amiga DOS software is set up to read data directly into its
final destination.  From C language or whatever, you may get double buffering
if you use character by character I/O or whatever, but if you make a direct
OS call, DMA device can directly use the given buffers.

> Agreed that I want the DMA to be 32 bits wide.  That is just very
> difficult for those of us that cannot crank up a silicon foundry whenever
> we get the itch. . .

Oh, well, I guess some of you will always have to live like that :-).

> Note again, that in real life the processor is going to have to copy the
> data somewhere else (to the ultimate consumer) once it is DMA-ed into the
> system disk buffer.  

No it isn't.  The only time the shared memory scheme wins is if the final
destination happens to be in the area of shared memory, in MEMF_HARDDISK so
to speak.  Otherwise, you'll have to do a CPU copy to the final destination,
whereas the DMA device could have put it directly there, since it can 
address all of memory.  I guess you can always tune your system software to
take advantage of the hardware, and perhaps the other way 'round too.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"