Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!csd4.milw.wisc.edu!bionet!apple!oliveb!Ozona!chase From: chase@Ozona.orc.olivetti.com (David Chase) Newsgroups: comp.arch Subject: Re: i860 floating-point Message-ID: <46044@oliveb.olivetti.com> Date: 4 Aug 89 22:30:45 GMT References: <45980@oliveb.olivetti.com> Sender: news@oliveb.olivetti.com Reply-To: chase@Ozona.UUCP (David Chase) Distribution: usa Organization: Olivetti Research Center, Menlo Park, CA Lines: 69 Continuing a previous follow-up: Note -- the instruction written as "ml2apm" in the previous posting is probably really called "m12apm". If you're reading this in a true typewriter font, you won't see the difference. >> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >> Well, I haven't heard anything about the i860 lately on this group, so >> maybe I should start things up again.... >>(3) The programmers manual refers to instructions that use both the >> adder and the multiplier, but most of these look like accumulate >> functions (e.g. dot product). Is there a "linked triad" instruction >> which takes 3 operands and does x = y + q*z, where 4 registers are >> used? >There are three problems that illustrate this, sort of (assume >row-major storage, i.e., not Fortran, in the examples that follow) [1 & 2 discussed in previous post] (3) elementary row operations Gaussian elimination, among other things Elementary row operations are not as easy to get going fast as inner product operations, because each instruction has (as written) two input operands and one output operand. (Actually, there are three input operands, but we're about to answer John McCalpin's question above). Typically, also, it will happen that one of the input operands will be cached, but the other input operand and the output operand will not be cached (i.e., will be loaded and stored by the pipelined load/store operations). The difficulty with this is that pipelined load/store operands have a maximum operand size of 64 bits (consider the width of the bus to memory, and you'll see why), or only two operands. Working in this style with a loop unrolled 8 times will give at best 8/11 of peak floating-point speed. A digression -- typically, the repeated part of a row operation is of the form: Row1[k] := Row1[k] + factor * Row2[k] for increasing i. That is, there are four operands, but one of them doesn't change. The i860 FPU accomodates this by having a number of register-like things in the FPU named KR, KI, and T. KR and KI can act as inputs to the multiplier, and T can buffer a result from the multiplier before it is fed to the adder. That is, m12apm feeds the result of the multiplier directly to the adder (with the result of the adder used for the other adder input), while m12ttpa inserts T between the adder and the multiplier. All in all, I count 32 different ways to hook up the multiplier and the adder. In addition, the adder can also subtract, making for a grand total of 64 different dual-operation pipelined FPU instructions. For the example above, either r2p1 or i2p1 might do the trick. (End digression) Anyhow, row operations don't go at full speed if the updated row is stored in memory and the operands are single precision. Things get even slower in double precision, since even more instructions are necessary to load the operands. I'm still thinking about how this might be sped up. It might be possible to double- or triple-up the row operations (they're usually part of a larger algorithm, like QR factorization or Gaussian elimination), but I haven't thought it through yet. David