Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!microsoft!brianw
From: brianw@microsoft.UUCP (Brian WILLOUGHBY)
Newsgroups: comp.sys.apple
Subject: Re: Really small question (warning: long tutorial)
Summary: The good news and the bad news
Message-ID: <10100@microsoft.UUCP>
Date: 30 Dec 89 03:29:47 GMT
References: <kZTLjCG00WB7Q=4bJa@andrew.cmu.edu> <9542@microsoft.UUCP> <742@batman.moravian.EDU> <10071@microsoft.UUCP> <14552@orstcs.CS.ORST.EDU>
Reply-To: brianw@microsoft.UUCP (Brian WILLOUGHBY)
Organization: Microsoft Corp., Redmond WA
Lines: 194

throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes:
>
>brianw@microsoft.UUCP (Brian WILLOUGHBY) writes:
><nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
><>is that as fast as DMA which is supposed to be (at least what I've always
><>been told) 1 cycle per byte moved?
><
><Have you compared the speeds in an actual coding situation?
>
>No, it's 7 cycles per byte moved.  I timed an MVN moving one bank (64K) at
>452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles
>per instruction.  Considering that there was probably a bit of overhead
>at the start or end, and maybe a few interrupts, it looks like 7 to me.

On the Apple, the *average* clock speed is 1.020484 MHz when you consider
that at the end of each video line the final clock cycle is shortened by one
period of the 14.3818 MHz clock.  In order to keep the video data in sync with
the phase of the colorburst signal, the Apple can't use a constant frequency
clock.  Pick up the SAMs manual called "The Apple II Circuit Description" for
more details.

>>Brian Willoughby
>
>Henry Throop

I should have known that I would have to eat my words if I posted before
checking the docs.  Straight from WDC: 7 cycles per byte.  That's the bad news.

The good news is that MVN/MVP is still the fastest *generic* move, where you
have total freedom over length and source and destination addresses.  Any
method of moving data faster than MV* is necessarily limited by either source
address, destination address, or BOTH.  Too bad that WDC hasn't designed a
standard 6502 bus DMA controller chip, yet.

After grabbing the documentation, I looked at how many different ways I could
move the 8192 bytes that make up the hires screen.  What follows is a summary
(hopefully not too boring) of several different approaches to moving data with
the 65C8xx :

move 8192 bytes:
Method                  #cycles  #bytes of code
Stack & Direct Page      65536   19
MVN                      57344   12
Partially Expanded Loop  57047   28
  variation of above     45143   648
Expanded (no loop)       40960   24576


EXPANDED (NO LOOP)

Subash Shankar pointed out that many graphics moves are not looped - they have
a separate instruction for each word moved.  Using 4096 LDA/STA pairs, the
65C8xx can moves 8192 bytes in only 40960 cycles, but this code occupies 24576
bytes of memory!  I think that this is the absolute fastest way to move that
many unknown (i.e. non-constant) bytes, without DMA.


STACK AND DIRECT PAGE MOVES

Looking at the number of cycles needed for each addressing mode would help:

LDA         STA   PHA 3
6   (d,x)    6    PLA 4
5 * (d),y    6    PEA 5      a
5   (d)      5    PER 6      pc+a
4   d,s      4    PEI 6 or 7 (d)
7   (d,s),y  7
3   d        3
4   d,x      4
6   [d]      6
6   [d],y    6
2   #        -
4 * a,y      5
4   a        4
4 * a,x      5
5   al       5
5   al,x     5

All instructions take an extra cycle to move a word instead of a single byte.
The addressing moves with an asterisk * after the LDA timing take an extra
cycle if you are using 16 bit indexing.  Since you can only move 256 bytes
with 8 bit indexing, this extra cycle will have to be considered.  For example:
LDA a,X takes 6 cycles when reading a word from memory with X set to 16 bits.

The quickest mode is d, or direct page, but you can't move more than 256 bytes
to the direct page.  The modes d,x and d,s are the next fastest, and give a
hint that the stack might be useful in faster moves.  Looking at PHA and PLA
show just 3 or 4 cycles respectively.

Someone mentioned a graphics hacker who used PEI.  It looks like the fastest
operation would be PEA, which only takes 5 cycles to place a CONSTANT word on
the stack.  If you were plotting a static shape to the screen, and you first
set S to the highest address, then the quickest way to change sequential bytes
would be PEI.  But for each horizontal line, you would need to update the stack
pointer (unless the shape occupied the full width of the screen).

Curiously, using 16 bit index registers does NOT add an extra cycle when
LoaDing the Accumulator from direct page indexed memory.  This actually allows
accesses outside the 256 byte direct page to be faster than the normal absolute
addressing modes.

The fastest short loop I could design using the above knowledge was as follows:

     lda #Length ;use $1FFE for hires
     tax
     clc
     adc #Dest   ;use $2000
     tas
     lda #Source
     tad
Loop lda $00,x   ;5 cycles
     pha         ;4
     dex         ;2
     dex         ;2
     bpl Loop    ;3  this limits Length to a maximum of $8000

This code uses 65536 cycles to move 8192 bytes.  It is too slow because of the
7 cycles of loop overhead to decrement x and loop back again.


PARTIALLY EXPANDED LOOP

I figured that there had to be a compromise between the fast non-looped move
which used 25K of memory and the short loop which took longer than MVN.  How
about a very long loop?  This would make the time for loop overhead have a
smaller effect in comparison to the time for actually moving the data.  To
avoid hard-coding BOTH the source AND destination addresses, the direct page
indexed mode could be used for exactly one address, and the Direct Page
Register could be changed to point to the right part of memory just before the
move.  Using 16 bit index registers, I figured the longest stretch of code
would be 256 bytes before the direct page was exhausted and the index register
would need to be changed to access more memory.

Source can be anywhere, but in this example Destination is hard-coded.
You could easily use this same algorithm with a fixed Source and variable Dest
by rewriting it.  N refers to the number of LDA/STA pairs that are repeated
before the loop restarts:

     ldx #Length   ;$1FFE - N*2
     lda #Source
     tad
; Dest is hard-coded as an absolute address
Loop lda $00,x     ;5 cycles
     sta $2000,x   ;6
     lda $02,x
     sta $2002,x
  ...              ;repeat LDA/STA pair for a total of N times
     lda $00+N-2,x
     sta $2000+N-2,x

     txa           ;2 cycles
     sec           ;2
     sbc #LoopSize ;3  LoopSize = N*2
     tax           ;2
     bpl Loop      ;3  this limits Length to $8000 or less

The only choice left is to find N, the number of times to repeat the LDA/STA
pairs to gain cycle time efficiency.  The limiting maximum would be N = 128,
because the direct page is only 256 bytes long, and we are moving word data.

There is a simple formula for the resulting code size and cycle time.

Counting the number of bytes per opcode: Size of code = 5N + 8

Number of cycles = (11N + 12)I
where I = Number of iterations of loop, I'll use 8192 bytes again
I = 4096 words/N
cycles = (11N + 12)*4096/N = 44759 + 49152/N

Using the maximum N = 128,  cycles = 45143, size = 648 bytes

The minimum code size, using MVN as a limit for cycles, can be found as
follows:

cycles = 44759 + 49152/N < MVN = 57344
therefore N > 3.9056, N has to be a whole number, so any value 4 or greater
yeilds a loop that it faster than MVN.
Using N = 4,  cycles = 57047, size = 28 bytes

What does this mean?  I've just proven to myself that you can write a rather
limited move loop that is faster than MVN, and only takes slightly more than
twice the code.  But it is not nearly as flexible.  In other words, you
couldn't use this algorithm in a Memory Manager subroutine of the Operating
System.


I hope that a few a these coding algorithms prove useful to someone else.

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP