Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!mailrus!cornell!uw-beaver!rice!titan!preston From: preston@titan.rice.edu (Preston Briggs) Newsgroups: comp.arch Subject: Re: i860 floating-point Message-ID: <430@brazos.Rice.edu> Date: 4 Aug 89 22:29:07 GMT References: Sender: root@rice.edu Reply-To: preston@titan.rice.edu (Preston Briggs) Distribution: usa Organization: Rice University, Houston Lines: 111 In article mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >The pipelined floating-point instructions are definitely not vector >instructions, since one instruction is required for every operation. However, vectoring compiler technology can be readily (has been, in fact) adapted to make good code for the i860. They treat the on-chip cache as vector registers of reasonable length, and use tight, highly pipelined loops to operate in these "registers." The scheme is actually more flexible than using "real" vector registers since they can do arbitrarily complex operations, not just simple adds or multiplies. >So it looks like a optimized fp code would be something the code below - >All the src? and dest? are register addresses, pointing to the 16 64-bit >floating-point registers. Actually, there are 32x32 bit FP registers. f0 and f1 are reserved. They can be treated as 16x64 bit if desired. > f_mul src1, scr2, dest0 % begin loading pipe > f_mul src3, src4, dest0 % load second stage of pipe > pf_mul src5, src6, dest1 % specifies dest for 1st multiply!!! > pf_mul src7, src8, dest2 % specifies dest for 2nd multiply > pf_mul dum0, dum0, dest3 % begin unloading pipe > pf_mul dum0, dum0, dest4 % end unloading pipe You're close. There's actually another stage in the pipeline. More accurate is pfmul.ss f2,f3,f0 start f2*f3 (single precision) pfmul.ss f4,f5,f0 dump garbage into f0 pfmul.ss f6,f7,f0 pfmul.ss f8,f9,f2 f2 gets results of f2*f3 ... >At this point I am finished, as I have used 8 source registers and 4 >destination registers. Three more registers are available to store >other scalars and such, but since loads are by 128-bit quantities, it >is convenient to work with an even number of elements. (Note that only >15 registers can be used, since fp0 is a fixed floating-point zero.) >So I have done 4 floating-point operations in 6 cycles, and now I have >to store the results to memory and grab some new sources from memory. >I can load/store in pipelined mode, too, with two 64-bit operands >transferred every cycle, with a 3-cycle pipeline length. So I would >need two store instructions (4 cycles), and four load instructions (6 >cycles) to re-fill the registers that I used. >Overall, then, it takes something like 16 cycles to deliver 4 64-bit >floating-point operations, assuming no data or instruction cache misses. >To get this level of performance requires loop unrolling to a depth of >4 for this simple operation. More complicated loops may not be >unrollable to even this depth, since only 16 registers are available. >Lots of questions: >(1) Am I missing something obvious? >(2) Can more things be overlapped than this? >(3) The programmers manual refers to instructions that use both the > adder and the multiplier, but most of these look like accumulate > functions (e.g. dot product). Is there a "linked triad" instruction > which takes 3 operands and does x = y + q*z, where 4 registers are > used? >John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu > mccalpin@delocn.udel.edu Basically, more stuff can be overlapped. 1st, there is the "dual instruction mode". This allows you to simultaneously execute an integer ("core") instruction and a floating-point instruction. Core instructions include loading and storing FP registers. So, we can overlap loading and storing with actual arithmetic. The i860 also has an auto-increment addressing mode for zapping quickly through vectors. 2nd, there is are the "dual operation instructions". These are the ones that allow simulataneous addition and multiplication. Unfortunately, it isn't possible to be completely general when specifying the sources and destinations of the adder and multiplier. However, it is possible to use both the dual instruction mode and the dual operation instructions at the same time. 3rd and so on, you can get more use out of the registers than you have suggested. After the 1st instruction (in my example), it would be possible to reuse registers f2 and f3 immediately. It's also possible to use techniques like "software pipelining" or "perfect pipelining" to help keep the pipelines full during loops. For matrix multiply, it looks possible to achieve a rate of 1 FLOP/cycle without using the dual-operation instructions. Using the dual-operation instruction, it should be possible to approach 5/3 FLOP/cycle. This includes memory latency, and is perhaps achievable using automatic techniques. All this will be fairly difficult for a simple compiler to use. However, by using vectoring front-ends, they can take advantage of hand-built subroutines implementing many common operations. In addition, a really agressive compiler will have lots of opportunities for optimization. I think that a fairly complex chip like the i860 will accent the difference between the best dependence-based optimizing compilers and the comparatively simple compilers common today. Instead of a 30% improvement, or even a factor of 2 improvement, over a PCC-like compiler, we'll see a fairly sizeable integer factor. Optimistically yours, Preston Briggs