Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!Ozona!chase
From: chase@Ozona.orc.olivetti.com (David Chase)
Newsgroups: comp.arch
Subject: Re: i860 floating-point
Message-ID: <45980@oliveb.olivetti.com>
Date: 4 Aug 89 01:12:59 GMT
References: <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu>
Sender: news@oliveb.olivetti.com
Reply-To: chase@Ozona.UUCP (David Chase)
Distribution: usa
Organization: Olivetti Research Center, Menlo Park, CA
Lines: 158

In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>Well, I haven't heard anything about the i860 lately on this group, so
>maybe I should start things up again....

>I have been re-reading various postings about the i860 floating-point unit
>and my current impression is that it is seriously bizarre.

>Lots of questions:

>(1) Am I missing something obvious?
>(2) Can more things be overlapped than this?
>(3) The programmers manual refers to instructions that use both the
>    adder and the multiplier, but most of these look like accumulate
>    functions (e.g. dot product).  Is there a "linked triad" instruction
>    which takes 3 operands and does x = y + q*z, where 4 registers are
>    used?

(1) I would say, no, you are not missing anything obvious.
(2) Yes, quite a bit.
(3) Sort of.

I bet you'd like an explanation of that, wouldn't you?  The best way
to make the chip scream is to get it into dual-instruction mode,
issuing one dual-operation floating point instruction and one core
instruction at each as clock tick.  This is how you get the 120 MOPS
figure for a 40MHz chip (I have no idea what clock rate is actually
being shipped, if any, but 40 is a nice round number).  Anyhow, to do
this, and actually be doing something useful, you have to (1) come up
with some computation that can actually make use of the many strange
ways that the adder and multiplier pipes can be hooked up and (2)
figure out how to amortize the load-store bandwidth of one floating
point instruction onto ONE core unit instruction.

There are three problems that illustrate this, sort of (assume
row-major storage, i.e., not Fortran, in the examples that follow)

(1) row-row dot product
    inner loop of A * B-transpose
    (This example appears in the i860 PRM in several forms)

(2) row-column dot product
    inner loop of A * B

(3) elementary row operations
    Gaussian elimination, among other things

For a row-row dot product, the inner loop looks like (in LaTeX):

d.ml2apm.ss  f4, f12, f0 & fld.q  16(r29)++, f8     & Load $A$ \\
d.ml2apm.ss  f5, f13, f0 & pfld.d 8(r24)++, f16     & Load $B$ \\
d.ml2apm.ss  f6, f14, f0 & pfld.d 8(r24)++, f18     & Load $B$ \\
d.ml2apm.ss  f7, f15, f0 & fld.q  16(r29)++, f4     & Load $A$ \\
d.ml2apm.ss  f8, f16, f0 & nop                      &          \\
d.ml2apm.ss  f9, f17, f0 & pfld.d 8(r24)++, f12     & Load $B$ \\
d.ml2apm.ss f10, f18, f0 & bla r27,r28, inner\_loop &          \\
d.ml2apm.ss f11, f19, f0 & pfld.d 8(r24)++, f14     & Load $B$

The idea is that the row of matrix A is stored in the cache, but the
row of matrix B is not.  The loop is unrolled 8 times in order to get
the required separation (for top speed) between the instructions which
load the registers and the instructions which use them.  The 8
floating point instructions require 16 input operands, and these must
somehow be loaded in 7 core unit instructions.  This is achieved by
using quad-word loads from cache (8/4 = 2) and double-word pipeline
loads (8/2 = 4).  I have omitted the gory details of start-up code,
and bothering to detail which load puts the results where, when.

One detail that should not be omitted is the floating point
instruction used here.  "ml2apm.ss" advances the (3-stage) multiplier
pipeline one step and advances the (3-stage) adder pipeline one step.
The inputs to the multiplier are the two register operands; the inputs
to the adder are the output of the multiplier and the output of the
adder.  In all the instructions above the output register is f0,
meaning "discard".  (Those of you who care about re-association of
your floating point arithmetic should be a little worried by now.)

Anyhow, the chip runs at full speed once the row of A is in the cache.
Since only that row is cached, it can be pretty large.


(2) is much more subtle.  The problem is that however you unroll that
loop, the elements of the column of B are not stored consecutively,
and thus must be loaded one-at-a-time.  For an unrolling by N, this
means that N (load B) + N/4 (load A) + 1 (branch) core instructions
are required for N floating point instructions.  That is, if N = 8
then the chip can only operate at 8/11 (72%) of potential MFlops.

However, matrix multiplication (of the non-Strassen sort) is
O(N-cubed), and matrix transposition is only O(N-squared).
Asymptotically, you can win that way.

However however, there is a way to do it in place.  Observe that so
far we have been multiplying one row of A by one column of B, and we
are choking on the instructions to load B.  Obviously, all we need to
do is reuse those loads a few times.  We can do this by multiplying 3
rows of A by one column of B.  Why 3?  The adder has three stages, and
when running as an accumulator we can treat those as registers.  It is
claimed that the following code will do the trick:

d.ml2apm.ss f6,f16,f0  & nop                       & \\
d.ml2apm.ss f10,f16,f0 & nop                       & \\
d.ml2apm.ss f14,f16,f0 & pfld.s r20(r19)++, f18    & \\
\\
d.ml2apm.ss f7,f17,f0  & 16(r16)++, fld.q f4       & \\
d.ml2apm.ss f11,f17,f0 & 16(r17)++, fld.q f8       & \\
d.ml2apm.ss f15,f17,f0 & 16(r18)++, fld.q f12      & \\
\\
d.ml2apm.ss f4,f18,f0  & pfld.s r20(r19)++, f19    & \\
d.ml2apm.ss f8,f18,f0  & nop                       & \\
d.ml2apm.ss f12,f18,f0 & pfld.s r20(r19)++, f16    & \\
\\
d.ml2apm.ss f5,f19,f0  & nop                       & \\
d.ml2apm.ss f9,f19,f0  & bla r21, r22, inner\_loop & \\
d.ml2apm.ss f13,f19,f0 & pfld.s r20(r19)++, f17    & \\

In this case, the floating point unit is actually computing three
inner products simultaneously.  Doing this also means that the
additions in the inner product are not reordered, which is a small win.

This might help a little.  Each iteration of the loop will multiply
these registers, loaded from A

   f6 &  f7 &  f4 &  f5
  f10 & f11 &  f8 &  f9
  f14 & f15 & f12 & f13

by these registers, loaded from B

  f16
  f17
  f18
  f19

updating three quantities circulating around in the adder pipeline.
At the end of the loop, the pipelines look like this:

|--------Multiplier----------|-------------Adder----------------|

                                 +------------------------------+
                                 V                              |
f13*f19 -> f9*f19 -> f5*f19 ---> f12*f18 --> f8*f18 --> f4*f18 -+-> f0
                                +f15*f17    +f11*f17   +f7*f17
                                +f14*f16    +f10*f16   +f6*f16
                                +etc        +etc       +etc

Obviously, this is not obvious, and all the edge conditions are enough
to make you tear your hair.  Also note that those delightful load
operations prefer that their operands be aligned (at least, that's how
I understood it).  Anyhow, once you cross all the i's and dot all the
t's, you should be able to multiply matrices in the "usual way" at
full speed, i.e. 80 MFlops for a 40Mhz chip.

I'll post more, later, if people are interested.  Dinner and a movie
are waiting.  Problem (3) is not so easy, because of bandwidth
constraints (more operands), but I haven't tried to rearrange the
algorithms yet.

David