Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!bionet!apple!oliveb!Ozona!chase
From: chase@Ozona.orc.olivetti.com (David Chase)
Newsgroups: comp.arch
Subject: Re: i860 floating-point
Message-ID: <46044@oliveb.olivetti.com>
Date: 4 Aug 89 22:30:45 GMT
References: <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> <45980@oliveb.olivetti.com>
Sender: news@oliveb.olivetti.com
Reply-To: chase@Ozona.UUCP (David Chase)
Distribution: usa
Organization: Olivetti Research Center, Menlo Park, CA
Lines: 69

Continuing a previous follow-up:

Note -- the instruction written as "ml2apm" in the previous posting is
probably really called "m12apm".  If you're reading this in a true
typewriter font, you won't see the difference.

>> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>> Well, I haven't heard anything about the i860 lately on this group, so
>> maybe I should start things up again....

>>(3) The programmers manual refers to instructions that use both the
>>    adder and the multiplier, but most of these look like accumulate
>>    functions (e.g. dot product).  Is there a "linked triad" instruction
>>    which takes 3 operands and does x = y + q*z, where 4 registers are
>>    used?

>There are three problems that illustrate this, sort of (assume
>row-major storage, i.e., not Fortran, in the examples that follow)

[1 & 2 discussed in previous post]

(3) elementary row operations
    Gaussian elimination, among other things

Elementary row operations are not as easy to get going fast as inner
product operations, because each instruction has (as written) two
input operands and one output operand.  (Actually, there are three
input operands, but we're about to answer John McCalpin's question
above).  Typically, also, it will happen that one of the input
operands will be cached, but the other input operand and the output
operand will not be cached (i.e., will be loaded and stored by the
pipelined load/store operations).  The difficulty with this is that
pipelined load/store operands have a maximum operand size of 64 bits
(consider the width of the bus to memory, and you'll see why), or only
two operands.  Working in this style with a loop unrolled 8 times will
give at best 8/11 of peak floating-point speed.

A digression -- typically, the repeated part of a row operation is of
the form:

     Row1[k] := Row1[k] + factor * Row2[k]

for increasing i.

That is, there are four operands, but one of them doesn't change.  The
i860 FPU accomodates this by having a number of register-like things
in the FPU named KR, KI, and T.  KR and KI can act as inputs to the
multiplier, and T can buffer a result from the multiplier before it is
fed to the adder.  That is, m12apm feeds the result of the multiplier
directly to the adder (with the result of the adder used for the other
adder input), while m12ttpa inserts T between the adder and the
multiplier.  All in all, I count 32 different ways to hook up the
multiplier and the adder.  In addition, the adder can also subtract,
making for a grand total of 64 different dual-operation pipelined FPU
instructions.  For the example above, either r2p1 or i2p1 might do the
trick.
(End digression)

Anyhow, row operations don't go at full speed if the updated row is
stored in memory and the operands are single precision.  Things get
even slower in double precision, since even more instructions are
necessary to load the operands.

I'm still thinking about how this might be sped up.  It might be
possible to double- or triple-up the row operations (they're usually
part of a larger algorithm, like QR factorization or Gaussian
elimination), but I haven't thought it through yet.

David