Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!sun-barr!newstop!exodus!rbbb.Eng.Sun.COM!chased
From: chased@rbbb.Eng.Sun.COM (David Chase)
Newsgroups: comp.arch
Subject: Re: Anything wrong with the i860
Message-ID: <13570@exodus.Eng.Sun.COM>
Date: 18 May 91 23:33:31 GMT
References: <1991May7.145407.18417@midway.uchicago.edu> <13008@pt.cs.cmu.edu> <3996@ssc-bee.ssc-vax.UUCP>
Sender: news@exodus.Eng.Sun.COM
Organization: Sun Microsystems, Mt. View, Ca.
Lines: 68

carroll@ssc-vax.UUCP (Jeff Carroll) writes:
>	Intel has also OEMed i860 compilers from Green Hills and the Portland
>Group, and struck an (unannounced) deal with Multiflow for compiler
>technology at about the same time they announced the Alliant deal.
>
>	I think that Alliant's failure to produce its compiler on schedule
>likely says more about Alliant than it does about the i860.

I don't think this is a valid criticism.  There were compilers for the
i860 some time ago; the question is whether the writers of those
compilers were attempting to generate code for (as Preston Briggs put
it) the i860's "sweet spot".  If so, the only thing they can be
faulted for is thinking they could be done by now.

The BEST code for the i860 is very hard to generate.  I expect that
when compilers are finally generating very good code for that chip,
they will do so by blocking techniques used to ensure that operands
are in the cache at known alignments (that is, blocks will be copied).

It is also possible in good situations to generate code that does not
require this copying, but I've only seen it done by hand.  I tried
some of it myself (taking into account all the rules about stalled
instructions in the reference manual) and managed to write some code
for matrix multiply (not the transposed version in the manual) that
would apparently hit "full speed" for any matrix where a single row
would fit in the cache.  It was very hard.

The difficulty comes from having to get everything right
simultaneously -- in the triply nested loop, you unroll an outer loop
by three, jam it into the innermost loop, unroll that by four,
recognize that you can accumulate three inner products simultaneously
in the adder pipeline (but that's because you know to select the right
instruction), and load the proper operands from the cache, load the
other operands using the cache-bypassing pipelined load instructions,
and know that everything is correctly aligned so that multi-word loads
can be used.

If you didn't unroll by three, or if you selected the wrong
instruction, or if you didn't do the registers-in-the-pipeline trick,
or if you didn't get the proper assign of operands to cache and main
memory, or if the alignments weren't right, then the performance drops
precipitously.

For other operations (e.g., elementary row operations of linear
algebra) you pretty much have to reorganize operands so that
everything is in the cache.  If not, the bottleneck is architectural;
if you assume that B is cached but A is not (i.e., we expect to
eliminate B from many rows, so we put it in the cache) in

   A[i] = A[i] + F * B[i]

you end up with 64 bits of off-chip I/O per cycle.  This is the max,
and the chip could do it, except that the instruction grouping rules
say, "one from column F, one from column I".  If you have N Mpy-adds,
that's N instructions to schlep A around, but you still need N/4 more
to load B from the cache (using quad-word loads that are not available
in the pipelined form) plus one to control the loop.

Of course, it goes w/o saying that you've already done dependence
analysis on everything, and that you are running the chip in
dual-instruction mode.  I haven't even begun to talk about the
details, so you begin to see what a "challenge" this chip is.

Another approach would be to use pattern-matching (big patterns, too)
and just call canned routines written by hand.

David Chase
Sun