Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!microsoft!brianw From: brianw@microsoft.UUCP (Brian WILLOUGHBY) Newsgroups: comp.sys.apple Subject: Re: Really small question (warning: long tutorial) Summary: The good news and the bad news Message-ID: <10100@microsoft.UUCP> Date: 30 Dec 89 03:29:47 GMT References: <9542@microsoft.UUCP> <742@batman.moravian.EDU> <10071@microsoft.UUCP> <14552@orstcs.CS.ORST.EDU> Reply-To: brianw@microsoft.UUCP (Brian WILLOUGHBY) Organization: Microsoft Corp., Redmond WA Lines: 194 throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes: > >brianw@microsoft.UUCP (Brian WILLOUGHBY) writes: ><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How ><>is that as fast as DMA which is supposed to be (at least what I've always ><>been told) 1 cycle per byte moved? >< > >No, it's 7 cycles per byte moved. I timed an MVN moving one bank (64K) at >452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles >per instruction. Considering that there was probably a bit of overhead >at the start or end, and maybe a few interrupts, it looks like 7 to me. On the Apple, the *average* clock speed is 1.020484 MHz when you consider that at the end of each video line the final clock cycle is shortened by one period of the 14.3818 MHz clock. In order to keep the video data in sync with the phase of the colorburst signal, the Apple can't use a constant frequency clock. Pick up the SAMs manual called "The Apple II Circuit Description" for more details. >>Brian Willoughby > >Henry Throop I should have known that I would have to eat my words if I posted before checking the docs. Straight from WDC: 7 cycles per byte. That's the bad news. The good news is that MVN/MVP is still the fastest *generic* move, where you have total freedom over length and source and destination addresses. Any method of moving data faster than MV* is necessarily limited by either source address, destination address, or BOTH. Too bad that WDC hasn't designed a standard 6502 bus DMA controller chip, yet. After grabbing the documentation, I looked at how many different ways I could move the 8192 bytes that make up the hires screen. What follows is a summary (hopefully not too boring) of several different approaches to moving data with the 65C8xx : move 8192 bytes: Method #cycles #bytes of code Stack & Direct Page 65536 19 MVN 57344 12 Partially Expanded Loop 57047 28 variation of above 45143 648 Expanded (no loop) 40960 24576 EXPANDED (NO LOOP) Subash Shankar pointed out that many graphics moves are not looped - they have a separate instruction for each word moved. Using 4096 LDA/STA pairs, the 65C8xx can moves 8192 bytes in only 40960 cycles, but this code occupies 24576 bytes of memory! I think that this is the absolute fastest way to move that many unknown (i.e. non-constant) bytes, without DMA. STACK AND DIRECT PAGE MOVES Looking at the number of cycles needed for each addressing mode would help: LDA STA PHA 3 6 (d,x) 6 PLA 4 5 * (d),y 6 PEA 5 a 5 (d) 5 PER 6 pc+a 4 d,s 4 PEI 6 or 7 (d) 7 (d,s),y 7 3 d 3 4 d,x 4 6 [d] 6 6 [d],y 6 2 # - 4 * a,y 5 4 a 4 4 * a,x 5 5 al 5 5 al,x 5 All instructions take an extra cycle to move a word instead of a single byte. The addressing moves with an asterisk * after the LDA timing take an extra cycle if you are using 16 bit indexing. Since you can only move 256 bytes with 8 bit indexing, this extra cycle will have to be considered. For example: LDA a,X takes 6 cycles when reading a word from memory with X set to 16 bits. The quickest mode is d, or direct page, but you can't move more than 256 bytes to the direct page. The modes d,x and d,s are the next fastest, and give a hint that the stack might be useful in faster moves. Looking at PHA and PLA show just 3 or 4 cycles respectively. Someone mentioned a graphics hacker who used PEI. It looks like the fastest operation would be PEA, which only takes 5 cycles to place a CONSTANT word on the stack. If you were plotting a static shape to the screen, and you first set S to the highest address, then the quickest way to change sequential bytes would be PEI. But for each horizontal line, you would need to update the stack pointer (unless the shape occupied the full width of the screen). Curiously, using 16 bit index registers does NOT add an extra cycle when LoaDing the Accumulator from direct page indexed memory. This actually allows accesses outside the 256 byte direct page to be faster than the normal absolute addressing modes. The fastest short loop I could design using the above knowledge was as follows: lda #Length ;use $1FFE for hires tax clc adc #Dest ;use $2000 tas lda #Source tad Loop lda $00,x ;5 cycles pha ;4 dex ;2 dex ;2 bpl Loop ;3 this limits Length to a maximum of $8000 This code uses 65536 cycles to move 8192 bytes. It is too slow because of the 7 cycles of loop overhead to decrement x and loop back again. PARTIALLY EXPANDED LOOP I figured that there had to be a compromise between the fast non-looped move which used 25K of memory and the short loop which took longer than MVN. How about a very long loop? This would make the time for loop overhead have a smaller effect in comparison to the time for actually moving the data. To avoid hard-coding BOTH the source AND destination addresses, the direct page indexed mode could be used for exactly one address, and the Direct Page Register could be changed to point to the right part of memory just before the move. Using 16 bit index registers, I figured the longest stretch of code would be 256 bytes before the direct page was exhausted and the index register would need to be changed to access more memory. Source can be anywhere, but in this example Destination is hard-coded. You could easily use this same algorithm with a fixed Source and variable Dest by rewriting it. N refers to the number of LDA/STA pairs that are repeated before the loop restarts: ldx #Length ;$1FFE - N*2 lda #Source tad ; Dest is hard-coded as an absolute address Loop lda $00,x ;5 cycles sta $2000,x ;6 lda $02,x sta $2002,x ... ;repeat LDA/STA pair for a total of N times lda $00+N-2,x sta $2000+N-2,x txa ;2 cycles sec ;2 sbc #LoopSize ;3 LoopSize = N*2 tax ;2 bpl Loop ;3 this limits Length to $8000 or less The only choice left is to find N, the number of times to repeat the LDA/STA pairs to gain cycle time efficiency. The limiting maximum would be N = 128, because the direct page is only 256 bytes long, and we are moving word data. There is a simple formula for the resulting code size and cycle time. Counting the number of bytes per opcode: Size of code = 5N + 8 Number of cycles = (11N + 12)I where I = Number of iterations of loop, I'll use 8192 bytes again I = 4096 words/N cycles = (11N + 12)*4096/N = 44759 + 49152/N Using the maximum N = 128, cycles = 45143, size = 648 bytes The minimum code size, using MVN as a limit for cycles, can be found as follows: cycles = 44759 + 49152/N < MVN = 57344 therefore N > 3.9056, N has to be a whole number, so any value 4 or greater yeilds a loop that it faster than MVN. Using N = 4, cycles = 57047, size = 28 bytes What does this mean? I've just proven to myself that you can write a rather limited move loop that is faster than MVN, and only takes slightly more than twice the code. But it is not nearly as flexible. In other words, you couldn't use this algorithm in a Memory Manager subroutine of the Operating System. I hope that a few a these coding algorithms prove useful to someone else. Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP