Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!apple!oliveb!Ozona!chase From: chase@Ozona.orc.olivetti.com (David Chase) Newsgroups: comp.arch Subject: Re: i860 floating-point Message-ID: <45980@oliveb.olivetti.com> Date: 4 Aug 89 01:12:59 GMT References: Sender: news@oliveb.olivetti.com Reply-To: chase@Ozona.UUCP (David Chase) Distribution: usa Organization: Olivetti Research Center, Menlo Park, CA Lines: 158 In article mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >Well, I haven't heard anything about the i860 lately on this group, so >maybe I should start things up again.... >I have been re-reading various postings about the i860 floating-point unit >and my current impression is that it is seriously bizarre. >Lots of questions: >(1) Am I missing something obvious? >(2) Can more things be overlapped than this? >(3) The programmers manual refers to instructions that use both the > adder and the multiplier, but most of these look like accumulate > functions (e.g. dot product). Is there a "linked triad" instruction > which takes 3 operands and does x = y + q*z, where 4 registers are > used? (1) I would say, no, you are not missing anything obvious. (2) Yes, quite a bit. (3) Sort of. I bet you'd like an explanation of that, wouldn't you? The best way to make the chip scream is to get it into dual-instruction mode, issuing one dual-operation floating point instruction and one core instruction at each as clock tick. This is how you get the 120 MOPS figure for a 40MHz chip (I have no idea what clock rate is actually being shipped, if any, but 40 is a nice round number). Anyhow, to do this, and actually be doing something useful, you have to (1) come up with some computation that can actually make use of the many strange ways that the adder and multiplier pipes can be hooked up and (2) figure out how to amortize the load-store bandwidth of one floating point instruction onto ONE core unit instruction. There are three problems that illustrate this, sort of (assume row-major storage, i.e., not Fortran, in the examples that follow) (1) row-row dot product inner loop of A * B-transpose (This example appears in the i860 PRM in several forms) (2) row-column dot product inner loop of A * B (3) elementary row operations Gaussian elimination, among other things For a row-row dot product, the inner loop looks like (in LaTeX): d.ml2apm.ss f4, f12, f0 & fld.q 16(r29)++, f8 & Load $A$ \\ d.ml2apm.ss f5, f13, f0 & pfld.d 8(r24)++, f16 & Load $B$ \\ d.ml2apm.ss f6, f14, f0 & pfld.d 8(r24)++, f18 & Load $B$ \\ d.ml2apm.ss f7, f15, f0 & fld.q 16(r29)++, f4 & Load $A$ \\ d.ml2apm.ss f8, f16, f0 & nop & \\ d.ml2apm.ss f9, f17, f0 & pfld.d 8(r24)++, f12 & Load $B$ \\ d.ml2apm.ss f10, f18, f0 & bla r27,r28, inner\_loop & \\ d.ml2apm.ss f11, f19, f0 & pfld.d 8(r24)++, f14 & Load $B$ The idea is that the row of matrix A is stored in the cache, but the row of matrix B is not. The loop is unrolled 8 times in order to get the required separation (for top speed) between the instructions which load the registers and the instructions which use them. The 8 floating point instructions require 16 input operands, and these must somehow be loaded in 7 core unit instructions. This is achieved by using quad-word loads from cache (8/4 = 2) and double-word pipeline loads (8/2 = 4). I have omitted the gory details of start-up code, and bothering to detail which load puts the results where, when. One detail that should not be omitted is the floating point instruction used here. "ml2apm.ss" advances the (3-stage) multiplier pipeline one step and advances the (3-stage) adder pipeline one step. The inputs to the multiplier are the two register operands; the inputs to the adder are the output of the multiplier and the output of the adder. In all the instructions above the output register is f0, meaning "discard". (Those of you who care about re-association of your floating point arithmetic should be a little worried by now.) Anyhow, the chip runs at full speed once the row of A is in the cache. Since only that row is cached, it can be pretty large. (2) is much more subtle. The problem is that however you unroll that loop, the elements of the column of B are not stored consecutively, and thus must be loaded one-at-a-time. For an unrolling by N, this means that N (load B) + N/4 (load A) + 1 (branch) core instructions are required for N floating point instructions. That is, if N = 8 then the chip can only operate at 8/11 (72%) of potential MFlops. However, matrix multiplication (of the non-Strassen sort) is O(N-cubed), and matrix transposition is only O(N-squared). Asymptotically, you can win that way. However however, there is a way to do it in place. Observe that so far we have been multiplying one row of A by one column of B, and we are choking on the instructions to load B. Obviously, all we need to do is reuse those loads a few times. We can do this by multiplying 3 rows of A by one column of B. Why 3? The adder has three stages, and when running as an accumulator we can treat those as registers. It is claimed that the following code will do the trick: d.ml2apm.ss f6,f16,f0 & nop & \\ d.ml2apm.ss f10,f16,f0 & nop & \\ d.ml2apm.ss f14,f16,f0 & pfld.s r20(r19)++, f18 & \\ \\ d.ml2apm.ss f7,f17,f0 & 16(r16)++, fld.q f4 & \\ d.ml2apm.ss f11,f17,f0 & 16(r17)++, fld.q f8 & \\ d.ml2apm.ss f15,f17,f0 & 16(r18)++, fld.q f12 & \\ \\ d.ml2apm.ss f4,f18,f0 & pfld.s r20(r19)++, f19 & \\ d.ml2apm.ss f8,f18,f0 & nop & \\ d.ml2apm.ss f12,f18,f0 & pfld.s r20(r19)++, f16 & \\ \\ d.ml2apm.ss f5,f19,f0 & nop & \\ d.ml2apm.ss f9,f19,f0 & bla r21, r22, inner\_loop & \\ d.ml2apm.ss f13,f19,f0 & pfld.s r20(r19)++, f17 & \\ In this case, the floating point unit is actually computing three inner products simultaneously. Doing this also means that the additions in the inner product are not reordered, which is a small win. This might help a little. Each iteration of the loop will multiply these registers, loaded from A f6 & f7 & f4 & f5 f10 & f11 & f8 & f9 f14 & f15 & f12 & f13 by these registers, loaded from B f16 f17 f18 f19 updating three quantities circulating around in the adder pipeline. At the end of the loop, the pipelines look like this: |--------Multiplier----------|-------------Adder----------------| +------------------------------+ V | f13*f19 -> f9*f19 -> f5*f19 ---> f12*f18 --> f8*f18 --> f4*f18 -+-> f0 +f15*f17 +f11*f17 +f7*f17 +f14*f16 +f10*f16 +f6*f16 +etc +etc +etc Obviously, this is not obvious, and all the edge conditions are enough to make you tear your hair. Also note that those delightful load operations prefer that their operands be aligned (at least, that's how I understood it). Anyhow, once you cross all the i's and dot all the t's, you should be able to multiply matrices in the "usual way" at full speed, i.e. 80 MFlops for a 40Mhz chip. I'll post more, later, if people are interested. Dinner and a movie are waiting. Problem (3) is not so easy, because of bandwidth constraints (more operands), but I haven't tried to rearrange the algorithms yet. David