Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!amdcad!rpw3 From: rpw3@amdcad.AMD.COM (Rob Warnock) Newsgroups: comp.arch Subject: Re: Endian reversing MOVEs Message-ID: <24618@amdcad.AMD.COM> Date: 25 Feb 89 08:42:16 GMT References: <759@atanasoff.cs.iastate.edu> <772@atanasoff.cs.iastate.edu> Reply-To: rpw3@amdcad.amd.com (Rob Warnock) Organization: [Consultant] San Mateo, CA Lines: 70 Took me a little while to get around to it, but below is a fast (4 cycle) byte-swap for the Am29000, to round out the mix. And here's what I've seem to date: machine who reported language cycles notes ======= ============ ======== ====== ===== 29k rpw3@amd C 10 MIPS mashey@mips C 9 ARM rwilson@acorn C 7 VAX bimadre@kulcs assembler 5 Instr., not cycles! 88k klossner@tek assembler 7 (2+5n, maybe?) ARM rwilson@acorn assembler 4 1+3n 29k rpw3@amd assembler 4 1+3n The "1+3n" for ARM & 29k means that the work done by the first instruction is not destroyed by the remainder, thus blocks of words can be swapped in 3 cycles, asymptotically. I also think the 88k code could do it in 2+5n if the 0xFF00FF00 mask were generated explicitly. (But does it matter? Naaah!) ; byte-swap for Am29000: src = a, b, c, d mtsrim ALU, (1<<5)+24 ; set BP=1, FC=24 extract tmp, src, src ; tmp = d, a, b, c (really a rotate) exbyte dst, tmp, tmp ; dst = d, a, b, a inbyte dst, dst, tmp ; dst = d, c, b, a The 4-cycle method for the 29k thus bears some resemblance to the original poster's version which did the swap in-memory. In this case, we use the 29k instructions that support byte load/store (on the 29k you do a word load, then an EXBYTE.) Normally, the second arg of EXBYTE is an immediate 0, but here we *use* the fact that EXBYTE is really a specialized merge. In addition, we use the EXTRACT both-sources-same idiom to get a rotate. Finally, one instruction is saved by setting up both the Byte Pointer and the Funnel- shifter Count by loading the entire ALU status reg, rather than explicit loads of the equivalent BP and FC regs. (Makes one almost wish for an explicit ROTATE, rather than the 2-cycle "set FC, extract" idiom. But rotates of varying values probably aren't worth it, and the 29k can do several rotates of the same amount in 1+1n.) The C version was compiled with the Metaware C compiler with "-O" turned on (and a lot of compiler noise elided): unsigned foo(x) unsigned int x; { return (x << 24) | ((x & 0xff00) << 8) | ((x >> 8) & 0xff00) | (x >> 24); } _foo: sll gr120,lr2,24 const gr119,65280 ; (0xff00) and gr121,lr2,gr119 sll gr121,gr121,8 or gr120,gr121,gr120 srl gr121,lr2,8 and gr121,gr121,gr119 or gr120,gr121,gr120 srl gr121,lr2,24 jmpi lr0 or gr96,gr121,gr120 ----------- Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403