Path: utzoo!attcan!uunet!mailrus!uflorida!novavax!hcx1!tom From: tom@ssd.csd.harris.com (Tom Horsley) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Message-ID: Date: 12 Jan 90 12:25:11 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <648@s5.Morgan.COM> <1879@xyzzy.UUCP> <2811@yogi.oakhill.UUCP> Sender: news@hcx1.UUCP Organization: Harris Computer Systems Division Lines: 81 In-reply-to: marvin@oakhill.UUCP's message of 11 Jan 90 00:55:42 GMT In article <2811@yogi.oakhill.UUCP> marvin@oakhill.UUCP (Marvin Denman) writes: >The example in question was obviously for single precision. The original article specifically stated that the example was double precision, that is why I wondered where the numbers came from. > One clock could probably be saved in this case by optimizing the >loop to use bcnd instead of the compare and branch sequence. Maybe, but I got this code by assuming that I could do induction variable elimination and test replacement. In order to use bcnd, I need to count down to zero, which probably means adding in an extra subu, thus eating the cycle I just saved. Perhaps a sufficiently clever compiler could get around this. In any event neither 67 nor 68 is close to 48. >Data dependencies between iterations of a loop are a very significant >problem with unrolling loops. Hopefully the compiler will recognize the >nondependencies well enough to unroll most loops that can be unrolled. >I agree that on some loops there are dependencies that hinder unrolling. >If these can be identified though the compiler may even be able to >remove redundant loads. There is so much room for improvement that I >find it difficult to be pessimistic about the amount of improvement that >is possible. There is no question that compilers can generate better code than they do now. We are currently at the stage of doing a detailed examination of the code quality of our own 88k compilers here at Harris Computer Systems, and we are often horrified by some of the truly rotten code we produce. We ARE fixing these problems. (And occasionally we are uplifted by the terrific code we produce). However, there is a real problem with loop unrolling that depends on language semantics. In FORTRAN compilers it may well be possible to profitably unroll many loops, due to some of the aliasing restrictions that the FORTRAN standard imposes on arguments. In the long term in Ada, it is also possible because Ada requires a global program database which could someday be used to do the sorts of interprocedural analysis required to determine that no aliasing occurs. But on U**x systems, most code is written in C, increasingly even numerical code is written in C. But C pointers can point pretty much anywhere. Compilers generally have to make worst case assumptions. This means that in any loop like the one in the original example where there is a load through a pointer on the right of the statement and a store through a pointer on the left, the compiler will be forced to assume that the store must take place before the next loop iteration does a load. Even if you unroll the loop, this data dependence will still be in place. Unfortunately, the only way you can get the example loop fully pipelined is to do several multiplies and adds before actually storing the result. In this case, if the algorithm were coded in C, you could take almost no advantage of pipelining, the only thing unrolling would get you is a slight improvement in the loop overhead, incrementing and testing the induction variable. >I disagree. I think that unless the latency is very short (2 or maybe 3 >cycles) that pipelining will pay off on a normal application mix. >Marvin Denman >Motorola 88000 Design >cs.utexas.edu!oakhill!marvin Of course you disagree, you work for Motorola :-) Actually I didn't mean to imply that I thought pipelining was a bad idea, I am all in favor of it, because when you can take advantage of it it does a super job. I just wish that it didn't take so many clocks to get through the pipe, because when it does not work out so well you just have to eat the cycles and like it. In those cases I would prefer to eat as few cycles as possible. To paraphrase your comment about MIPS, it will be interesting to see if Motorola goes to fewer clocks for float instructions in the next generation chips. I still maintain that a large amount of real code (not artificial benchmarks) contains data dependencies that force serial computation. I would like this code to run fast as well. -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================