Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!cs.utexas.edu!sun-barr!newstop!exodus!rbbb.Eng.Sun.COM!chased From: chased@rbbb.Eng.Sun.COM (David Chase) Newsgroups: comp.arch Subject: Re: Anything wrong with the i860 Message-ID: <13570@exodus.Eng.Sun.COM> Date: 18 May 91 23:33:31 GMT References: <1991May7.145407.18417@midway.uchicago.edu> <13008@pt.cs.cmu.edu> <3996@ssc-bee.ssc-vax.UUCP> Sender: news@exodus.Eng.Sun.COM Organization: Sun Microsystems, Mt. View, Ca. Lines: 68 carroll@ssc-vax.UUCP (Jeff Carroll) writes: > Intel has also OEMed i860 compilers from Green Hills and the Portland >Group, and struck an (unannounced) deal with Multiflow for compiler >technology at about the same time they announced the Alliant deal. > > I think that Alliant's failure to produce its compiler on schedule >likely says more about Alliant than it does about the i860. I don't think this is a valid criticism. There were compilers for the i860 some time ago; the question is whether the writers of those compilers were attempting to generate code for (as Preston Briggs put it) the i860's "sweet spot". If so, the only thing they can be faulted for is thinking they could be done by now. The BEST code for the i860 is very hard to generate. I expect that when compilers are finally generating very good code for that chip, they will do so by blocking techniques used to ensure that operands are in the cache at known alignments (that is, blocks will be copied). It is also possible in good situations to generate code that does not require this copying, but I've only seen it done by hand. I tried some of it myself (taking into account all the rules about stalled instructions in the reference manual) and managed to write some code for matrix multiply (not the transposed version in the manual) that would apparently hit "full speed" for any matrix where a single row would fit in the cache. It was very hard. The difficulty comes from having to get everything right simultaneously -- in the triply nested loop, you unroll an outer loop by three, jam it into the innermost loop, unroll that by four, recognize that you can accumulate three inner products simultaneously in the adder pipeline (but that's because you know to select the right instruction), and load the proper operands from the cache, load the other operands using the cache-bypassing pipelined load instructions, and know that everything is correctly aligned so that multi-word loads can be used. If you didn't unroll by three, or if you selected the wrong instruction, or if you didn't do the registers-in-the-pipeline trick, or if you didn't get the proper assign of operands to cache and main memory, or if the alignments weren't right, then the performance drops precipitously. For other operations (e.g., elementary row operations of linear algebra) you pretty much have to reorganize operands so that everything is in the cache. If not, the bottleneck is architectural; if you assume that B is cached but A is not (i.e., we expect to eliminate B from many rows, so we put it in the cache) in A[i] = A[i] + F * B[i] you end up with 64 bits of off-chip I/O per cycle. This is the max, and the chip could do it, except that the instruction grouping rules say, "one from column F, one from column I". If you have N Mpy-adds, that's N instructions to schlep A around, but you still need N/4 more to load B from the cache (using quad-word loads that are not available in the pipelined form) plus one to control the loop. Of course, it goes w/o saying that you've already done dependence analysis on everything, and that you are running the chip in dual-instruction mode. I haven't even begun to talk about the details, so you begin to see what a "challenge" this chip is. Another approach would be to use pattern-matching (big patterns, too) and just call canned routines written by hand. David Chase Sun