Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!sdd.hp.com!mips!ptimtc!nntp-server.caltech.edu!toddpw
From: toddpw@nntp-server.caltech.edu (Todd P. Whitesel)
Newsgroups: comp.sys.apple2
Subject: Re: ML subroutines (passing parameters in ML)
Message-ID: <1991Apr26.221735.4063@nntp-server.caltech.edu>
Date: 26 Apr 91 22:17:35 GMT
References: <3397@kluge.fiu.edu> <13845@ucrmath.ucr.edu> <112122@tut.cis.ohio-state.edu> <13954@ucrmath.ucr.edu>
Organization: California Institute of Technology, Pasadena
Lines: 89

rhyde@dimaggio.ucr.edu (randy hyde) writes:

>True, it's easy, but you give up two important things about the direct page--
>the ability to use it as a set of 256 scratch pad registers, and furthermore,
>it's doubtful the dp would be page aligned costing you another cycle on each
>dp memory access.

Randy, now I'm convinced you're thinking like a 6502 assembly programmer. It
is far more advantageous to have the DP move around so you can access the
automatic and argument variables with every addressing mode and instruction
available to the registers. If the dp were fixed, how would you handle local
variables? Save and restore the dp to a software stack? You are NOT going
to convince me to use the stack relative to get at the arguments -- you can
only use XX,s and (XX,s) with the basic Accumulator operations (LDA/STA/CMP/
AND/ORA/EOR/ADC/SBC). The code size involved in the shuffling of stuff
between the dp and stack would be a lot worse than the way it is now, and
I think you will find if you try it (like I have) that the execution time of
16 bit code whose data structures are not entirely in the direct page (my
main examples, an LZW decompressor and DiskCopy checksum calculator, both of
which I worked on within the last few months) easily dominates the cycles lost
to a non-aligned direct page. You literally will not notice the difference in
execution time -- unless the code is VERY intensive with direct page
variables. You are better off not bothering to align the dp unless the code
really needs an extra 10 or 15 percent when written in assembly!! I am not
denying that it ever happens (e.g. animation code, high performance math,
etc.) but for general purpose programming it does not matter that much.

>I've often wondered why compilers insist on using the dp as a frame pointer
>rather than fixing dp (or pseudo-fixing it) and using register allocation
>schemes like a RISC machine.  The resulting code would run quite a bit faster.

Not on a 65816 it wouldn't. There aren't enough registers and you can't get
at the stack as powerfully as you can the dp. This is the single biggest
deficiency of the 65816 when it comes to HLL's, IMHO.

I do agree with better register allocation, though -- Orca's use of the data
bank as a 'globals' bank pointer is idiotic and limits the size of globals
to 64K. Absolute long should be used for global variables, freeing up the
DBR to act in concert with either X or Y as a scratch pointer that can random
access a 64k area located anywhere. Put a tad of lvalue caching in the code
generator and the compiler should be able to generate excellent code for
ptr->struct type situations (which happens a lot when you are using GS/OS and
the toolbox):

	p->h = 5;	p->v = 47;	p->boingptr = NULL;

	pei	p+1
	plb
	plb
	ldx	p
	lda	#5
	sta	|h,x
; would be pei,plb,plb,ldx but the code generator notices DBR/X hasn't changed
	lda	#47
	sta	|v,x
; DBR/X also hasn't changed
	stz	|boingptr,x
	stz	|boingptr+2,x
; can't stz [],y , can we, Mike?

The above code is about as good as a decent assembly programmer could do
given that p is an arbitrary pointer and that *p is smaller than 64K, and
a good code generator for the 65816 could do the same. Orca's use of the
data bank register for faster access to globals has the same effect as
fixing the dp (which Orca doesn't do, thankfully) -- you do make the trivial
code generation examples faster, but the gain is more than offset by the
nontrivial address computations (pointers used as arrays, structs, arrays,
and combinations of the above) which reasult in really gross code compared
to what they could be if the code generator took better advantage of the
CPU architecture.

I believe the 65816 is ADEQUATE for HLL's but that it does not leave the
compiler many options. Fighting the available instruction set and CPU
architecture is not a good idea, and that what I see Orca's DBR use doing;
Randy's fixed dp idea would make a lot of sense if functions were generally
large and not called often, but I still think the benefits from an aligned
DP are fairly insignificant for HLL programming. I prefer a system that
handles arbitrarily complex conditions as well as the architecture allows,
since the whole reason I'm using C for GS-specific programming is to
avoid dealing with complex objects (like parameter block structs and arrays
of various objects) in assembly! The routines that get moved to assembly are
the time-critical ones can be coded in tight assembly, and they use a fairly
simple template I developed for emulating Orca/C's function stack frame model
(except mine preserves the caller's DBR automatically). I modified the template
to align the DP and literally did not notice the difference in execution speed
(I did clock the LZW decompressor: it was approximately 2% faster.)

Todd Whitesel
toddpw @ tybalt.caltech.edu