Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Endian reversing MOVEs Message-ID: <13274@winchester.mips.COM> Date: 14 Feb 89 23:20:20 GMT References: <759@atanasoff.cs.iastate.edu> <772@atanasoff.cs.iastate.edu> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 81 In article zs01+@andrew.cmu.edu (Zalman Stern) writes: >For comparison purposes I've jotted down code for the RT, the AMD 29000, and the >MIPS R3000. These all assume you are byte switching a value from one register >into a distinct register. This is a minor difference from the original macro, .... >Finally, the R3000 routine, which is fairly generic and could be implemented on >other architectures: >andi temp, SRC, 0x00ff >sll DEST, temp, 24 >andi temp, SRC, 0xff00 >sll temp, temp, 8 >and DEST, temp, DEST >lui temp, 0x00ff >and temp, SRC, temp >srl temp, temp, 8 >and DEST, temp, DEST >lui temp, 0xff00 >and temp, SRC, temp >srl temp, temp, 24 >and temp, DEST, DEST > >13 single cycle instructions, one temporary register. Following is a C program that does this, 2 ways: a) Value starts in memory b) Value starts in register The relevant piece of the .s output is shown. The point is NOT that an R3000 is faster than Zalman's example, which was at least a credible try. The point is the relationship of hand-coded routines to code compiled from high-level languages; RISC chips were supposed to be designed for the latter. Could people perhaps post the best compiler-generated code for this function, and compare it with the best hand-coded code? (Use any C code you like). I'll start: C PROGRAM: struct x { unsigned char a, b, c, d}; unsigned y,z; main(p) struct x *p; { y = p->d << 24 | p->c << 16 | p->b << 8 | p->a; z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24); } GEENRATED .s: # 6 y = p->d << 24 | p->c << 16 | p->b << 8 | p->a; lbu $14, 3($4) sll $15, $14, 24 lbu $24, 2($4) sll $25, $24, 16 or $8, $15, $25 lbu $9, 1($4) sll $10, $9, 8 or $11, $8, $10 lbu $12, 0($4) or $13, $11, $12 sw $13, y * NOT COUNTED # 7 z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24); sll $14, $13, 24 and $24, $13, 65280 sll $15, $24, 8 or $25, $14, $15 sll $9, $13, 8 and $8, $9, 65280 or $10, $25, $8 srl $11, $13, 24 or $12, $10, $11 sw $12, z * NOT COUNTED The first case is 10 cycles (+ cache miss, if any); the second case is 9 cycles, both assuming 100% I-cache hit. As noted, I coded the C to take advantage of the fact that ands of 0xffff or less are good things. BTW: this isn't something I'd expect an R3000 to do especially well on. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086