Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!ncar!noao!mcdsun!fnf
From: fnf@mcdsun.UUCP (Fred Fish)
Newsgroups: comp.arch
Subject: Re: hard data on Motorola 88000
Message-ID: <833@mcdsun.UUCP>
Date: 21 Apr 88 18:55:42 GMT
References: <9916@tekecs.TEK.COM>
Reply-To: fnf@mcdsun.UUCP (Fred Fish)
Organization: Motorola Microcomputer Division
Lines: 138

In article <9916@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
>The announcement is today, so I guess it's okay to talk hard data on
>the Motorola 88000 architecture.

I hope so too, or we will both be in trouble...  :-)

Andrew presents lots of interesting information about our new baby, but
I'd like to elaborate on one point before rumors get started that all
loads and stores in a 32-bit address space require two instructions.

>           Load/store instructions can take a 16 bit offset and an
>index register, which can be scaled by a factor of 1, 2, 4, or 8.  To
>get to an arbitrary 32-bit address, you need two instructions:
>
>	or.u	r2,r0,hi16(address)	; high 16 bits of address to r2
>	ld	r2,r2,lo16(address)	; load word into r2

We recognized early in the development cycle of the C compiler and associated
tools that the 16 bit immediate values in some instructions had the 
potential to get us into the same ugly mess that the 80x86 camp is in,
with multiple memory "models" directly visable to the programmer.  We
wanted to hide this as much as possible, so the those programming in
a high level language, and to some extent those programming in assembler,
could simply treat the machine as if it had a linear 32-bit address
space with no special contortions necessary for access to any particular
object, no matter how large.  To demonstrate one of the features of the
tool set that accomplishes this goal, consider the following example program:

	char array[(4 * 64 * 1024) + 1];

	main ()
	{
		array[0 * 64 * 1024] = 1;
		array[1 * 64 * 1024] = 1;
		array[2 * 64 * 1024] = 1;
		array[3 * 64 * 1024] = 1;
		array[4 * 64 * 1024] = 1;
	}

The compiler produces the following assembly code (with comments stripped
by hand for the sake of saving some space):

		global		_main
		text
	_main:
		addu		r20,r0,1
		st.b		r20,r0,_array
		st.b		r20,r0,_array+65536
		st.b		r20,r0,_array+131072
		st.b		r20,r0,_array+196608
		st.b		r20,r0,_array+262144
		jmp		r1
		data
		comm		_array,262145

Note the lack of any hi16/lo16 pseudofunctions.  The compiler just
emits the straightforward, obvious code.  Note that the assembler
does not do any particular magic with this code either.  Any expressions
that do not evaluate to a constant small enough to fit into the allocated
slot in the object code, are simply passed on to the linker for evaluation.
Below is a disassembly of the relevant section of the .o file produced by
the above assembly code:

       _main 62800001 addu        r20,r0,$0001
   $00000004 2E800000 st.b        r20,r0,$0000
   $00000008 2E800000 st.b        r20,r0,$0000
   $0000000C 2E800000 st.b        r20,r0,$0000
   $00000010 2E800000 st.b        r20,r0,$0000
   $00000014 2E800000 st.b        r20,r0,$0000
   $00000018 F400C001 jmp         r1 (_main)


Now is where the interesting stuff starts.  The linker is allocated the
registers r26-r29, for it to use in any way it sees fit.  By convention,
the linker is also guaranteed that no user code will ever play with
these registers.  For the example above, the linker decides that it's
most efficient use of the registers, based on the final address of the
data section and some other factors, is to segment the data section
into three 64K segments, followed by an "infinite" length segment.
The first three registers, r26, r27, and r28 are set up as base pointers
to these first three segments, and the last linker register, r29, is
reserved for synthesizing 32-bit addresses into the remaining "infinite"
length segment.  Thus in effect, r29 becomes a dynamically changing
base pointer that gets changed on an instruction by instruction basis,
to point to the 64K data segment containing the referenced object.
When the linker does it's work, it actually patches the object code, 
changing register assignments and inserting instructions as necessary,
to produce the following code, which ultimately gets executed:

       _main 62800001 addu        r20,r0,$0001              
    _main+$4 2E9A0028 st.b        r20,r26,$0028
    _main+$8 2E9B0028 st.b        r20,r27,$0028
    _main+$C 2E9C0028 st.b        r20,r28,$0028
   _main+$10 5FA00043 or.u        r29,r0,$0043
   _main+$14 2E9D0028 st.b        r20,r29,$0028
   _main+$18 5FA00044 or.u        r29,r0,$0044
   _main+$1C 2E9D0028 st.b        r20,r29,$0028

Note that the data section for this sample starts at 0x40000.  The
$0028 offset comes from the fact that crt0.o contains $0028 worth of
data that gets linked before our test array.  I.E. the address of
_array ends up being 0x40028.  With this strategy, we have the
best of both worlds.  Loads and stores to objects low in the
data space use the more efficient single instruction form, while
loads and stores to objects far into the data space use the two
instruction form, and all of this is completely transparent to the
programmer.  He did not have to decide in advance whether to use
a "small model" or "huge model" for his program.

This is just the tip of the iceburg, there are lots of other optimizations
that become obvious.  By examining the static and dynamic characteristics
of the program, the data section objects can be sorted to get the most
frequently used objects into low data memory.  The linker might also 
decide that certain sections of the program reference portions of
data memory more often than others, and insert the appropriate code to
change the data mapping on the fly, rather than using a static mapping.

One loose end in our example needs to be tied up.  How do r26, r27, and
r28 get initialized?  The answer lies in crt0, where the linker patches
a section of code to initialize any registers it uses:

   __start     5F400040 or.u        r26,r0,$0040
   __start+$4  5B5A0000 or          r26,r26,$0000
   __start+$8  5F600041 or.u        r27,r0,$0041
   __start+$C  5B7B0000 or          r27,r27,$0000
   __start+$10 5F800042 or.u        r28,r0,$0042              
   __start+$14 5B9C0000 or          r28,r28,$0000             
   __start+$18 5FA00000 or.u        r29,r0,$0000              
   __start+$1C 5BBD0000 or          r29,r29,$0000             

I hope you have found this little example interesting.  I should note
that the general idea of having the linker synthesize necessary instruction
streams to hide the 16-bit literal constant problem was first proposed to
me by a long time Motorolan architecture expert, Bob Greiner.

-Fred
-- 
# Fred Fish    hao!noao!mcdsun!fnf    (602) 438-3614
# Motorola Computer Division, 2900 S. Diablo Way, Tempe, Az 85282  USA