Newsgroups: comp.sys.amiga.programmer Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!geech.gnu.ai.mit.edu!rjc From: rjc@geech.gnu.ai.mit.edu (Ray Cromwell) Subject: Re: Lemmings - a tutorial Part V (last) Message-ID: <1991Apr2.002631.22799@mintaka.lcs.mit.edu> Sender: daemon@mintaka.lcs.mit.edu (Lucifer Maleficius) Organization: The Internet References: <1991Apr1.020748.26863@mintaka.lcs.mit.edu> Date: Tue, 2 Apr 91 00:26:31 GMT Lines: 136 In article mykes@amiga0.SF-Bay.ORG (Mike Schwartz) writes: >In article <1991Apr1.020748.26863@mintaka.lcs.mit.edu> rjc@geech.gnu.ai.mit.edu (Ray Cromwell) writes: > >> Obviously you have no idea of how advanced today's optimizing compilers >>are. The code you stepped through must have been produced by some >>1970's MetaComco compiler or something. But FYI, most of todays compilers >>can pass arguements in registers, allocate memory without stack, eliminate >>the frame registers, and even do all the non-obvious tricks of sign extension, >>etc. Check out the code I've included at the end compiled by GCC. >> >>The following was compiled with GCC -O -fstrength-reduce -fomit-frame-register >>I don't have SAS C on the Amiga, but I'm sure it produces simular results. >> >>/* test.c */ >> >>char buf[20]; >>main() >>{ >> char *d=(char *)&buf; >> const char *s="This is a test\n"; >> while(*s) { *d++=*s++; } >>} >> >>/* Test.s produced by gcc */ >> >>#NO_APP >>gcc_compiled.: >>.text >>LC0: >> .ascii "This is a test\12\0" >> .even >>.globl _main >>_main: >> lea _buf,a1 >> lea LC0,a0 >> tstb a0@ >> jeq L5 >>L4: >> moveb a0@+,a1@+ >> tstb a0@ >> jne L4 >>L5: >> rts >>.comm _buf,20 >> > >/* Test.s produced by gcc */ > >#NO_APP >gcc_compiled.: >.text >LC0: > .ascii "This is a test\12\0" > .even >.globl _main >_main: ; (cycles) > lea _buf,a1 ; 8 > lea LC0,a0 ; 8 > tstb a0@ ; 8 > jeq L5 ; 8 >L4: > moveb a0@+,a1@+ ; 14*12 > tstb a0@ ; 14*8 > jne L4 ; 13*10+1*8 >L5: > rts ; 16 >.comm _buf,20 > > >;/* test.s produced by me */ ; (cycles) > lea text(pc),a0 ; 8 > lea buf(pc),a1 ; 8 BUG! This is not the same algorithm in the C code. Your code would move a NULL byte if a NULL string was passed when it should do nothing. I don't know why GCC doesn't generate PC relative instructions, but I think it has to do with UNIX and perhaps the scatter load/memory partitioning. >.loop move.b (a0)+,(a1)+ ; 14*12 > bne.s .loop ; 13*10+1*8 > rts ; 16 >text dc.b 'This is a test',10,0 >buf ds.b 20 > >NO COMPARISON DUDE! GCC makes 3 totally wasted instructions, and one of >them is inside your loop. Try an example with nested loops and your >wasted clock cycles become a geometric progression. Multiply the kind >of inefficiencies that GCC demonstrates here by EVERY loop and EVERY >function you have and your program is slower and bigger than it needs >to be. To be specific, the GCC routine is 6 words longer and by the >time it is done executing, it will take 128 more clock cycles than >mine will (on a 68000). Your routine takes 466 total clocks to execute, >mine takes 338. I'm just your average 68000 assembler language programmer, >but I saved 28% CPU time. You might also note that your 'C' source is >7 lines of code and so is my assembler code. This code is compiled to run on a 68030 @ 50mhz, the difference in speed would be _very_ small. Hey, someone compile this on SAS/C with ALL optimizations on. You're just your average assembly language programmer yet you introduced a subtle bug into the program that will stomp on the first byte of a static global string by attempting to copy a null string. In a HUGE program this bug would be so subtle that it may take days to find out how a memory location is being sporatically trashed. Worse yet, a null string will have garbage following it that could span hundreds of bytes. Your assembler code has the C equivelent of 'do { *d++=*s++; } while(*s);' >Just think, the OS ROM routines are written in 'C' and compiled with >a lesser compiler than gcc (5 years ago). Just think, OS2.0 is compiled with SAS/C. > > >> Also, it's not the language but the algorithm that is responsible for >>how fast a routine runs. Compare a C coded Boyer-Moore string search >>with an Assembly coded brute-force byte by byte search, the C code would >>probably win without optimizations turned on. >> > >You should compare apples to apples. Compare your Boyer-Moore string search >in assembler language with the one in 'C' on the Amiga. How about comparing >an assembler coded Boyer-Moore string search against a 'C' coded brute-force >byte by byte search? You'd complain it's not fair either. The point is, OPTIMIZATIONS only account for a marginal linear speed improvement while choosing the right algorithm could result in an order or two of magnitude improvement. Matt's assembler is perfect proof. If Matt coded it in assembly and added some features it would blow away DevPac and probably beat ArgAsm. -- /~\_______________________________________________________________________/~\ |n| rjc@albert.ai.mit.edu Amiga, the computer for the creative mind. |n| |~| .-. .-. |~| |_|________________________________| |_| |________________________________|_|