Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ukma!uflorida!stat!stat.fsu.edu!mccalpin From: mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) Newsgroups: comp.arch Subject: i860 floating-point Message-ID: Date: 3 Aug 89 16:40:57 GMT Sender: news@stat.fsu.edu Distribution: usa Organization: Supercomputer Computations Research Institute Lines: 61 Well, I haven't heard anything about the i860 lately on this group, so maybe I should start things up again.... I have been re-reading various postings about the i860 floating-point unit and my current impression is that it is seriously bizarre. The pipelined floating-point instructions are definitely not vector instructions, since one instruction is required for every operation. (Some instructions use both the adder and the multiplier, but it still requires a separate instruction for each input data pair). So it looks like a optimized fp code would be something the code below - apologies to those who know what the instruction format _really_ is. :-) All the src? and dest? are register addresses, pointing to the 16 64-bit floating-point registers. f_mul src1, scr2, dest0 % begin loading pipe f_mul src3, src4, dest0 % load second stage of pipe pf_mul src5, src6, dest1 % specifies dest for 1st multiply!!! pf_mul src7, src8, dest2 % specifies dest for 2nd multiply pf_mul dum0, dum0, dest3 % begin unloading pipe pf_mul dum0, dum0, dest4 % end unloading pipe At this point I am finished, as I have used 8 source registers and 4 destination registers. Three more registers are available to store other scalars and such, but since loads are by 128-bit quantities, it is convenient to work with an even number of elements. (Note that only 15 registers can be used, since fp0 is a fixed floating-point zero.) So I have done 4 floating-point operations in 6 cycles, and now I have to store the results to memory and grab some new sources from memory. I can load/store in pipelined mode, too, with two 64-bit operands transferred every cycle, with a 3-cycle pipeline length. So I would need two store instructions (4 cycles), and four load instructions (6 cycles) to re-fill the registers that I used. Overall, then, it takes something like 16 cycles to deliver 4 64-bit floating-point operations, assuming no data or instruction cache misses. To get this level of performance requires loop unrolling to a depth of 4 for this simple operation. More complicated loops may not be unrollable to even this depth, since only 16 registers are available. Lots of questions: (1) Am I missing something obvious? (2) Can more things be overlapped than this? (3) The programmers manual refers to instructions that use both the adder and the multiplier, but most of these look like accumulate functions (e.g. dot product). Is there a "linked triad" instruction which takes 3 operands and does x = y + q*z, where 4 registers are used? Comments welcome.... -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu mccalpin@delocn.udel.edu