Path: utzoo!attcan!uunet!mailrus!uflorida!novavax!hcx1!tom
From: tom@ssd.csd.harris.com (Tom Horsley)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Message-ID: <TOM.90Jan12072511@hcx2.ssd.csd.harris.com>
Date: 12 Jan 90 12:25:11 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu>
	<648@s5.Morgan.COM> <1879@xyzzy.UUCP>
	<TOM.90Jan9101628@hcx2.ssd.csd.harris.com> <2811@yogi.oakhill.UUCP>
Sender: news@hcx1.UUCP
Organization: Harris Computer Systems Division
Lines: 81
In-reply-to: marvin@oakhill.UUCP's message of 11 Jan 90 00:55:42 GMT

In article <2811@yogi.oakhill.UUCP> marvin@oakhill.UUCP (Marvin Denman) writes:

   >The example in question was obviously for single precision.

The original article specifically stated that the example was double precision,
that is why I wondered where the numbers came from.

   >            One clock could probably be saved in this case by optimizing the
   >loop to use bcnd instead of the compare and branch sequence.

Maybe, but I got this code by assuming that I could do induction variable
elimination and test replacement. In order to use bcnd, I need to count
down to zero, which probably means adding in an extra subu, thus eating
the cycle I just saved. Perhaps a sufficiently clever compiler could get
around this. In any event neither 67 nor 68 is close to 48.

   >Data dependencies between iterations of a loop are a very significant
   >problem with unrolling loops.  Hopefully the compiler will recognize the
   >nondependencies well enough to unroll most loops that can be unrolled.
   >I agree that on some loops there are dependencies that hinder unrolling.
   >If these can be identified though the compiler may even be able to
   >remove redundant loads.  There is so much room for improvement that I
   >find it difficult to be pessimistic about the amount of improvement that
   >is possible.

There is no question that compilers can generate better code than they do
now. We are currently at the stage of doing a detailed examination of the
code quality of our own 88k compilers here at Harris Computer Systems, and
we are often horrified by some of the truly rotten code we produce. We ARE
fixing these problems. (And occasionally we are uplifted by the terrific
code we produce).

However, there is a real problem with loop unrolling that depends on language
semantics. In FORTRAN compilers it may well be possible to profitably unroll
many loops, due to some of the aliasing restrictions that the FORTRAN standard
imposes on arguments. In the long term in Ada, it is also possible because
Ada requires a global program database which could someday be used to do the
sorts of interprocedural analysis required to determine that no aliasing
occurs. But on U**x systems, most code is written in C, increasingly even
numerical code is written in C. But C pointers can point pretty much anywhere.
Compilers generally have to make worst case assumptions. This means that
in any loop like the one in the original example where there is a load
through a pointer on the right of the statement and a store through a pointer
on the left, the compiler will be forced to assume that the store must
take place before the next loop iteration does a load. Even if you unroll
the loop, this data dependence will still be in place.

Unfortunately, the only way you can get the example loop fully pipelined is
to do several multiplies and adds before actually storing the result.  In
this case, if the algorithm were coded in C, you could take almost no
advantage of pipelining, the only thing unrolling would get you is a slight
improvement in the loop overhead, incrementing and testing the induction
variable.

   >I disagree.  I think that unless the latency is very short (2 or maybe 3
   >cycles) that pipelining will pay off on a normal application mix.

   >Marvin Denman
   >Motorola 88000 Design
   >cs.utexas.edu!oakhill!marvin

Of course you disagree, you work for Motorola :-)

Actually I didn't mean to imply that I thought pipelining was a bad idea, I
am all in favor of it, because when you can take advantage of it it does a
super job. I just wish that it didn't take so many clocks to get through the
pipe, because when it does not work out so well you just have to eat the
cycles and like it. In those cases I would prefer to eat as few cycles as
possible. To paraphrase your comment about MIPS, it will be interesting to
see if Motorola goes to fewer clocks for float instructions in the next
generation chips.

I still maintain that a large amount of real code (not artificial
benchmarks) contains data dependencies that force serial computation. I
would like this code to run fast as well.
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================