Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!mailrus!cornell!uw-beaver!rice!titan!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: i860 floating-point
Message-ID: <430@brazos.Rice.edu>
Date: 4 Aug 89 22:29:07 GMT
References: <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu>
Sender: root@rice.edu
Reply-To: preston@titan.rice.edu (Preston Briggs)
Distribution: usa
Organization: Rice University, Houston
Lines: 111

In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>The pipelined floating-point instructions are definitely not vector
>instructions, since one instruction is required for every operation.

    However, vectoring compiler technology can be readily (has been, in fact)
    adapted to make good code for the i860.  They treat the on-chip
    cache as vector registers of reasonable length, and use tight,
    highly pipelined loops to operate in these "registers."
    The scheme is actually more flexible than using "real" vector registers
    since they can do arbitrarily complex operations,
    not just simple adds or multiplies.

>So it looks like a optimized fp code would be something the code below -
>All the src? and dest? are register addresses, pointing to the 16 64-bit
>floating-point registers.

    Actually, there are 32x32 bit FP registers.  f0 and f1 are reserved.
    They can be treated as 16x64 bit if desired.

>	f_mul	src1, scr2, dest0	% begin loading pipe
>	f_mul	src3, src4, dest0	% load second stage of pipe
>	pf_mul	src5, src6, dest1	% specifies dest for 1st multiply!!!
>	pf_mul	src7, src8, dest2	% specifies dest for 2nd multiply
>	pf_mul	dum0, dum0, dest3	% begin unloading pipe
>	pf_mul	dum0, dum0, dest4	% end unloading pipe

    You're close.  There's actually another stage in the pipeline.
    More accurate is
	pfmul.ss	f2,f3,f0	start f2*f3 (single precision)
	pfmul.ss	f4,f5,f0	dump garbage into f0
	pfmul.ss	f6,f7,f0
	pfmul.ss	f8,f9,f2	f2 gets results of f2*f3
	...

>At this point I am finished, as I have used 8 source registers and 4
>destination registers.  Three more registers are available to store
>other scalars and such, but since loads are by 128-bit quantities, it
>is convenient to work with an even number of elements. (Note that only
>15 registers can be used, since fp0 is a fixed floating-point zero.)

>So I have done 4 floating-point operations in 6 cycles, and now I have
>to store the results to memory and grab some new sources from memory.
>I can load/store in pipelined mode, too, with two 64-bit operands
>transferred every cycle, with a 3-cycle pipeline length.  So I would
>need two store instructions (4 cycles), and four load instructions (6
>cycles) to re-fill the registers that I used.

>Overall, then, it takes something like 16 cycles to deliver 4 64-bit
>floating-point operations, assuming no data or instruction cache misses.
>To get this level of performance requires loop unrolling to a depth of
>4 for this simple operation.  More complicated loops may not be
>unrollable to even this depth, since only 16 registers are available.

>Lots of questions:
>(1) Am I missing something obvious?
>(2) Can more things be overlapped than this?
>(3) The programmers manual refers to instructions that use both the
>    adder and the multiplier, but most of these look like accumulate
>    functions (e.g. dot product).  Is there a "linked triad" instruction
>    which takes 3 operands and does x = y + q*z, where 4 registers are
>    used?
>John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
>		   mccalpin@delocn.udel.edu


Basically, more stuff can be overlapped.

1st, there is the "dual instruction mode".
This allows you to simultaneously execute an integer ("core") instruction and
a floating-point instruction.  Core instructions include loading and
storing FP registers.  So, we can overlap loading and storing
with actual arithmetic.  The i860 also has an auto-increment addressing
mode for zapping quickly through vectors.

2nd, there is are the "dual operation instructions".
These are the ones that allow simulataneous addition and
multiplication.  Unfortunately, it isn't possible to be completely
general when specifying the sources and destinations of the adder
and multiplier.  However, it is possible to use both the
dual instruction mode and the dual operation instructions
at the same time.

3rd and so on, 
you can get more use out of the registers
than you have suggested.  After the 1st instruction (in my example),
it would be possible to reuse registers f2 and f3 immediately.

It's also possible to use techniques like "software pipelining"
or "perfect pipelining" to help keep the pipelines full during loops.

For matrix multiply, it looks possible to achieve a rate of 1 FLOP/cycle
without using the dual-operation instructions.  Using
the dual-operation instruction, it should be possible to approach
5/3 FLOP/cycle.  This includes memory latency, and is perhaps
achievable using automatic techniques.

All this will be fairly difficult for a simple compiler to use.
However, by using vectoring front-ends, they can take
advantage of hand-built subroutines implementing many
common operations.  In addition, a really agressive
compiler will have lots of opportunities for optimization.

I think that a fairly complex chip like the i860 will
accent the difference between the best dependence-based
optimizing compilers and the comparatively simple compilers
common today.  Instead of a 30% improvement, or even a factor
of 2 improvement, over a PCC-like compiler, we'll see a 
fairly sizeable integer factor.

Optimistically yours,
Preston Briggs