Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!sdd.hp.com!mips!ptimtc!nntp-server.caltech.edu!toddpw From: toddpw@nntp-server.caltech.edu (Todd P. Whitesel) Newsgroups: comp.sys.apple2 Subject: Re: ML subroutines (passing parameters in ML) Message-ID: <1991Apr26.221735.4063@nntp-server.caltech.edu> Date: 26 Apr 91 22:17:35 GMT References: <3397@kluge.fiu.edu> <13845@ucrmath.ucr.edu> <112122@tut.cis.ohio-state.edu> <13954@ucrmath.ucr.edu> Organization: California Institute of Technology, Pasadena Lines: 89 rhyde@dimaggio.ucr.edu (randy hyde) writes: >True, it's easy, but you give up two important things about the direct page-- >the ability to use it as a set of 256 scratch pad registers, and furthermore, >it's doubtful the dp would be page aligned costing you another cycle on each >dp memory access. Randy, now I'm convinced you're thinking like a 6502 assembly programmer. It is far more advantageous to have the DP move around so you can access the automatic and argument variables with every addressing mode and instruction available to the registers. If the dp were fixed, how would you handle local variables? Save and restore the dp to a software stack? You are NOT going to convince me to use the stack relative to get at the arguments -- you can only use XX,s and (XX,s) with the basic Accumulator operations (LDA/STA/CMP/ AND/ORA/EOR/ADC/SBC). The code size involved in the shuffling of stuff between the dp and stack would be a lot worse than the way it is now, and I think you will find if you try it (like I have) that the execution time of 16 bit code whose data structures are not entirely in the direct page (my main examples, an LZW decompressor and DiskCopy checksum calculator, both of which I worked on within the last few months) easily dominates the cycles lost to a non-aligned direct page. You literally will not notice the difference in execution time -- unless the code is VERY intensive with direct page variables. You are better off not bothering to align the dp unless the code really needs an extra 10 or 15 percent when written in assembly!! I am not denying that it ever happens (e.g. animation code, high performance math, etc.) but for general purpose programming it does not matter that much. >I've often wondered why compilers insist on using the dp as a frame pointer >rather than fixing dp (or pseudo-fixing it) and using register allocation >schemes like a RISC machine. The resulting code would run quite a bit faster. Not on a 65816 it wouldn't. There aren't enough registers and you can't get at the stack as powerfully as you can the dp. This is the single biggest deficiency of the 65816 when it comes to HLL's, IMHO. I do agree with better register allocation, though -- Orca's use of the data bank as a 'globals' bank pointer is idiotic and limits the size of globals to 64K. Absolute long should be used for global variables, freeing up the DBR to act in concert with either X or Y as a scratch pointer that can random access a 64k area located anywhere. Put a tad of lvalue caching in the code generator and the compiler should be able to generate excellent code for ptr->struct type situations (which happens a lot when you are using GS/OS and the toolbox): p->h = 5; p->v = 47; p->boingptr = NULL; pei p+1 plb plb ldx p lda #5 sta |h,x ; would be pei,plb,plb,ldx but the code generator notices DBR/X hasn't changed lda #47 sta |v,x ; DBR/X also hasn't changed stz |boingptr,x stz |boingptr+2,x ; can't stz [],y , can we, Mike? The above code is about as good as a decent assembly programmer could do given that p is an arbitrary pointer and that *p is smaller than 64K, and a good code generator for the 65816 could do the same. Orca's use of the data bank register for faster access to globals has the same effect as fixing the dp (which Orca doesn't do, thankfully) -- you do make the trivial code generation examples faster, but the gain is more than offset by the nontrivial address computations (pointers used as arrays, structs, arrays, and combinations of the above) which reasult in really gross code compared to what they could be if the code generator took better advantage of the CPU architecture. I believe the 65816 is ADEQUATE for HLL's but that it does not leave the compiler many options. Fighting the available instruction set and CPU architecture is not a good idea, and that what I see Orca's DBR use doing; Randy's fixed dp idea would make a lot of sense if functions were generally large and not called often, but I still think the benefits from an aligned DP are fairly insignificant for HLL programming. I prefer a system that handles arbitrarily complex conditions as well as the architecture allows, since the whole reason I'm using C for GS-specific programming is to avoid dealing with complex objects (like parameter block structs and arrays of various objects) in assembly! The routines that get moved to assembly are the time-critical ones can be coded in tight assembly, and they use a fairly simple template I developed for emulating Orca/C's function stack frame model (except mine preserves the caller's DBR automatically). I modified the template to align the DP and literally did not notice the difference in execution speed (I did clock the LZW decompressor: it was approximately 2% faster.) Todd Whitesel toddpw @ tybalt.caltech.edu