Path: utzoo!mnetor!uunet!lll-winken!lll-tis!ames!ncar!noao!mcdsun!fnf From: fnf@mcdsun.UUCP (Fred Fish) Newsgroups: comp.arch Subject: Re: hard data on Motorola 88000 Message-ID: <833@mcdsun.UUCP> Date: 21 Apr 88 18:55:42 GMT References: <9916@tekecs.TEK.COM> Reply-To: fnf@mcdsun.UUCP (Fred Fish) Organization: Motorola Microcomputer Division Lines: 138 In article <9916@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes: >The announcement is today, so I guess it's okay to talk hard data on >the Motorola 88000 architecture. I hope so too, or we will both be in trouble... :-) Andrew presents lots of interesting information about our new baby, but I'd like to elaborate on one point before rumors get started that all loads and stores in a 32-bit address space require two instructions. > Load/store instructions can take a 16 bit offset and an >index register, which can be scaled by a factor of 1, 2, 4, or 8. To >get to an arbitrary 32-bit address, you need two instructions: > > or.u r2,r0,hi16(address) ; high 16 bits of address to r2 > ld r2,r2,lo16(address) ; load word into r2 We recognized early in the development cycle of the C compiler and associated tools that the 16 bit immediate values in some instructions had the potential to get us into the same ugly mess that the 80x86 camp is in, with multiple memory "models" directly visable to the programmer. We wanted to hide this as much as possible, so the those programming in a high level language, and to some extent those programming in assembler, could simply treat the machine as if it had a linear 32-bit address space with no special contortions necessary for access to any particular object, no matter how large. To demonstrate one of the features of the tool set that accomplishes this goal, consider the following example program: char array[(4 * 64 * 1024) + 1]; main () { array[0 * 64 * 1024] = 1; array[1 * 64 * 1024] = 1; array[2 * 64 * 1024] = 1; array[3 * 64 * 1024] = 1; array[4 * 64 * 1024] = 1; } The compiler produces the following assembly code (with comments stripped by hand for the sake of saving some space): global _main text _main: addu r20,r0,1 st.b r20,r0,_array st.b r20,r0,_array+65536 st.b r20,r0,_array+131072 st.b r20,r0,_array+196608 st.b r20,r0,_array+262144 jmp r1 data comm _array,262145 Note the lack of any hi16/lo16 pseudofunctions. The compiler just emits the straightforward, obvious code. Note that the assembler does not do any particular magic with this code either. Any expressions that do not evaluate to a constant small enough to fit into the allocated slot in the object code, are simply passed on to the linker for evaluation. Below is a disassembly of the relevant section of the .o file produced by the above assembly code: _main 62800001 addu r20,r0,$0001 $00000004 2E800000 st.b r20,r0,$0000 $00000008 2E800000 st.b r20,r0,$0000 $0000000C 2E800000 st.b r20,r0,$0000 $00000010 2E800000 st.b r20,r0,$0000 $00000014 2E800000 st.b r20,r0,$0000 $00000018 F400C001 jmp r1 (_main) Now is where the interesting stuff starts. The linker is allocated the registers r26-r29, for it to use in any way it sees fit. By convention, the linker is also guaranteed that no user code will ever play with these registers. For the example above, the linker decides that it's most efficient use of the registers, based on the final address of the data section and some other factors, is to segment the data section into three 64K segments, followed by an "infinite" length segment. The first three registers, r26, r27, and r28 are set up as base pointers to these first three segments, and the last linker register, r29, is reserved for synthesizing 32-bit addresses into the remaining "infinite" length segment. Thus in effect, r29 becomes a dynamically changing base pointer that gets changed on an instruction by instruction basis, to point to the 64K data segment containing the referenced object. When the linker does it's work, it actually patches the object code, changing register assignments and inserting instructions as necessary, to produce the following code, which ultimately gets executed: _main 62800001 addu r20,r0,$0001 _main+$4 2E9A0028 st.b r20,r26,$0028 _main+$8 2E9B0028 st.b r20,r27,$0028 _main+$C 2E9C0028 st.b r20,r28,$0028 _main+$10 5FA00043 or.u r29,r0,$0043 _main+$14 2E9D0028 st.b r20,r29,$0028 _main+$18 5FA00044 or.u r29,r0,$0044 _main+$1C 2E9D0028 st.b r20,r29,$0028 Note that the data section for this sample starts at 0x40000. The $0028 offset comes from the fact that crt0.o contains $0028 worth of data that gets linked before our test array. I.E. the address of _array ends up being 0x40028. With this strategy, we have the best of both worlds. Loads and stores to objects low in the data space use the more efficient single instruction form, while loads and stores to objects far into the data space use the two instruction form, and all of this is completely transparent to the programmer. He did not have to decide in advance whether to use a "small model" or "huge model" for his program. This is just the tip of the iceburg, there are lots of other optimizations that become obvious. By examining the static and dynamic characteristics of the program, the data section objects can be sorted to get the most frequently used objects into low data memory. The linker might also decide that certain sections of the program reference portions of data memory more often than others, and insert the appropriate code to change the data mapping on the fly, rather than using a static mapping. One loose end in our example needs to be tied up. How do r26, r27, and r28 get initialized? The answer lies in crt0, where the linker patches a section of code to initialize any registers it uses: __start 5F400040 or.u r26,r0,$0040 __start+$4 5B5A0000 or r26,r26,$0000 __start+$8 5F600041 or.u r27,r0,$0041 __start+$C 5B7B0000 or r27,r27,$0000 __start+$10 5F800042 or.u r28,r0,$0042 __start+$14 5B9C0000 or r28,r28,$0000 __start+$18 5FA00000 or.u r29,r0,$0000 __start+$1C 5BBD0000 or r29,r29,$0000 I hope you have found this little example interesting. I should note that the general idea of having the linker synthesize necessary instruction streams to hide the 16-bit literal constant problem was first proposed to me by a long time Motorolan architecture expert, Bob Greiner. -Fred -- # Fred Fish hao!noao!mcdsun!fnf (602) 438-3614 # Motorola Computer Division, 2900 S. Diablo Way, Tempe, Az 85282 USA