Xref: utzoo comp.sys.amiga:17090 comp.sys.amiga.tech:160
Path: utzoo!mnetor!uunet!tektronix!tekcrl!tekfdi!videovax!stever
From: stever@videovax.Tek.COM (Steven E. Rice, P.E.)
Newsgroups: comp.sys.amiga,comp.sys.amiga.tech
Subject: Re: 68030 Questions
Message-ID: <4937@videovax.Tek.COM>
Date: 31 Mar 88 17:39:33 GMT
References: <4890@videovax.Tek.COM> <3507@cbmvax.UUCP>
Reply-To: stever@videovax.Tek.COM (Steven E. Rice, P.E.)
Organization: Tektronix Television Systems, Beaverton, Oregon
Lines: 131
Summary: We're sort of edging toward agreement. . .

In article <3507@cbmvax.UUCP>, Dave Haynie (daveh@cbmvax.UUCP) writes:

> in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
>> 
>> Dave Haynie's (daveh@cbmvax) most recent article was number
>> <3394@cbmvax.UUCP>.  In it, he cast aspersions on the poor, struggling
>> LANCE and suggested that real systems do 32-bit DMA.  Well, maybe --
>> but if you want to use Ethernet, the LANCE is about the only way to
>> go, slow or no!
> 
> Calm down!  That's not what I said.  I said that in very high 
> bandwidth-consuming operations, such as hard disk interfacing, where the
> transfer between an I/O device and CPU addressable main memory can be sent
> in large atoms, is best served by DMA, even in a 68020 or 68030 system. I
> also said that in systems where transfers must occur in small atoms or at
> relatively slow speed (like perhaps networks or things which must be
> highly interactive), the I/O scheme to shared CPU memory was a good idea.

I think there is still some misunderstanding here.  When I mention dual-
ported memories, I am speaking of memory that is "CPU addressable main
memory"!  It just happens to also be shared (on a cycle-by-cycle basis)
with some other device, which could be an I/O device or another CPU.

The Amiga implements a form of "shared" memory -- chip memory.  The
CPU gets access to chip memory on a shared basis, arbitrated cycle
by cycle.  Another form of "shared" memory is seen on the A2620 (?)
card -- the 68020 CPU.  The 68020 will have 2 or 4 megabytes of 32-bit
wide memory which no one can deny it access to.  Thus, if DMA is
occurring to "main" memory, the 68020 may not be blocked at all.  Carrying
the idea one step further simply removes more limitations from the system,
giving the CPU unrestricted access to the system bus and immediate access
to any memory that is not in use during that memory cycle.
 
>> In a perfect world, 32-bit DMA with a 512-byte assembly buffer and 
>> fast-as-a-speeding-bullet burst transfers would be possible.  In real
>> life, we have to make do with what we can buy.  (Commodore can build
>> what it needs; the economics in the Television Test and Measurement
>> market are different than those in the personal computer market.)
> 
> That's true, Commodore can build what it needs for those cases.  The 16 bit 
> wide DMA driven hard disk controller on the 16 bit bus delivers around 625K
> bytes/second with the Fast FileSystem.  Fast FileSystem allows DMA from the
> hard disk directly to the target memory, not intermediate buffers used.  I
> believe that any peripheral going this fast wants DMA.  It's fully extensible
> to a 32 bit machine, though a _conservative_ 32 bit machine rates that's 
> 2.5 megabytes/second thoughput (not even getting to things like burst 
> transfers, which are ideally suited to DMA transfers).  If you're LAN is only
> going 2.5 megabits/sec, that's certainly overkill and extra cost.

Ethernet is 10 megabits/sec.

> Which seems to make sense even today; most Amiga hard drives are DMA driven,
> most Amiga LANs are CPU driven via shared RAM DMA.

In the case of Ethernet I/O, transmissions are packetized with quite
a bit of protocol overhead.  Thus, the data to be transmitted must be
broken into chunks no larger than the largest legitimate packet and
shipped out one packet at a time.  To do this, the CPU is going to have
to move the data anyway -- it has to configure it in a form the I/O
device can use.  In this case, the copy from what you might consider
"main" memory to "shared" memory is free.

Starting with the FFS rate of 625K bytes/second and doubling that for a 
32-bit bus gives 1.25 megabytes/second.  This translates to a 10
megabit/second transfer rate, which is the same as the Ethernet.  Using
your figure of 2.5 megabytes per second gives 20 megabits/second
throughput.  But our CPU bus bandwidth is about 100 megabits/second
(approximately 330 nsec main memory cycle time [not *access* time --
*cycle* time]).  Thus, a 2.5 megabyte/second disk transfer would occupy
only 20% of the bus bandwidth.

If the disk DMA is transferring into unshared main memory, the CPU will
just have to wait.  At 2.5 megabytes/second (assuming 32-bit transfers),
the disk will request one memory access every 1.6 microseconds.

One possibility is to arbitrate for the bus for each transfer.  Looking
at the timing diagrams in the Motorola 68020 manual, one finds that
there is a minimum of 1/2 clock period and a maximum of 1 clock period
from the end of clock state S5 until Bus Grant* is asserted.  There is
also a note in paragraph 5.2.7.4 which says that "all asynchronous
inputs to the MC68020 are internally synchronized in a maximum of two
cycles of the system clock."  This implies that the minimum to resume
processing is 1 clock cycle.  There is probably one additional cycle
needed for the CPU to resume driving the address and data lines.

Assuming a memory cycle time of 330 ns (which is what ours is) with
240 ns read or write access time, each 32-bit word transferred would
hold the CPU bus for one arbitration time (1/2 to 1 clock cycles, or
30 to 60 ns in a 16.7 MHz system) plus one transfer time (240 ns) plus
one bus relinquishment time (1 to 2 clock cycles, or 60 to 120 ns)
plus one driver turnon time (1 clock cycle, or 60 ns).  The minimum
time required would be 390 ns, the maximum time would be 480 ns, and
the mean time would be 435 ns.

435 ns out of 1.6 us is 27.2% of the bus bandwidth occupied.  But not
only is 27.2% of the bus bandwidth occupied, the CPU is denied the
bus 27.2% of the time!  This translates directly into throughput
reduction.

Another possibility is to block the data into (e.g.) 512 byte blocks and
then arbitrate for the bus once per block.  This drops the bus bandwidth
occupation to 20% (since one arbitration is insignificant compared to the
time to transfer 512 bytes as 128 32-bit words).  But the CPU is still
denied the bus 20% of the time.

If, however, the disk data is DMAed into dual-ported memory, it can deny
an access to the CPU a *maximum* of 20% of the time, and then only if
the CPU is fetching all of its instructions from the shared memory!  In
actual operation, it is likely to be much less than that.  There is also
no reason the receiving process cannot use the data directly from the
dual-ported memory, although in many cases there will be at least one
copy between initial transfer and use of the data.

>> There is another thought, too -- if you have only one DMA device, you
>> could argue that it shouldn't make much difference if it DMAs into
>> system RAM or into a dual-ported buffer.  If you have more than one
>> device contending for the system bus, however, multiple dual-ported
>> buffers are a clear win.
> 
> Not unless you have multiple CPUs to read them.

Given just a single hard disk transfer as you have described it, DMA into
a dual-port buffer avoids losing 20% of the CPU's processing capability.
That seems worthwhile to me!

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever