Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ukma!uflorida!stat!stat.fsu.edu!mccalpin
From: mccalpin@masig3.ocean.fsu.edu (John D. McCalpin)
Newsgroups: comp.arch
Subject: i860 floating-point
Message-ID: <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu>
Date: 3 Aug 89 16:40:57 GMT
Sender: news@stat.fsu.edu
Distribution: usa
Organization: Supercomputer Computations Research Institute
Lines: 61

Well, I haven't heard anything about the i860 lately on this group, so
maybe I should start things up again....

I have been re-reading various postings about the i860 floating-point unit
and my current impression is that it is seriously bizarre.

The pipelined floating-point instructions are definitely not vector
instructions, since one instruction is required for every operation.
(Some instructions use both the adder and the multiplier, but it still
requires a separate instruction for each input data pair).

So it looks like a optimized fp code would be something the code below -
apologies to those who know what the instruction format _really_ is. :-)
All the src? and dest? are register addresses, pointing to the 16 64-bit
floating-point registers.

	f_mul	src1, scr2, dest0	% begin loading pipe
	f_mul	src3, src4, dest0	% load second stage of pipe
	pf_mul	src5, src6, dest1	% specifies dest for 1st multiply!!!
	pf_mul	src7, src8, dest2	% specifies dest for 2nd multiply
	pf_mul	dum0, dum0, dest3	% begin unloading pipe
	pf_mul	dum0, dum0, dest4	% end unloading pipe

At this point I am finished, as I have used 8 source registers and 4
destination registers.  Three more registers are available to store
other scalars and such, but since loads are by 128-bit quantities, it
is convenient to work with an even number of elements. (Note that only
15 registers can be used, since fp0 is a fixed floating-point zero.)

So I have done 4 floating-point operations in 6 cycles, and now I have
to store the results to memory and grab some new sources from memory.
I can load/store in pipelined mode, too, with two 64-bit operands
transferred every cycle, with a 3-cycle pipeline length.  So I would
need two store instructions (4 cycles), and four load instructions (6
cycles) to re-fill the registers that I used.

Overall, then, it takes something like 16 cycles to deliver 4 64-bit
floating-point operations, assuming no data or instruction cache misses.
To get this level of performance requires loop unrolling to a depth of
4 for this simple operation.  More complicated loops may not be
unrollable to even this depth, since only 16 registers are available.

Lots of questions:

(1) Am I missing something obvious?

(2) Can more things be overlapped than this?

(3) The programmers manual refers to instructions that use both the
    adder and the multiplier, but most of these look like accumulate
    functions (e.g. dot product).  Is there a "linked triad" instruction
    which takes 3 operands and does x = y + q*z, where 4 registers are
    used?

Comments welcome....


--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu