Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!mips!flash!rowen From: rowen@mips.COM (Chris Rowen) Newsgroups: comp.sys.mips Subject: Re: MIPS assembler question Summary: Suggested reordering of floating point instructions to exploit MIPS R3010 pipeline Message-ID: <43786@mips.mips.COM> Date: 4 Dec 90 18:10:36 GMT References: Sender: news@mips.COM Distribution: comp Lines: 55 Paul Hardy (ph@ama-1.ama.caltech.edu) writes: >The main body of the multiply is a triplet of instructions: simultaneously, >a load, add, and multiply are being performed on different registers. Since >they're not using each others' registers, they should all execute together. >According to the MIPS book, a single-precision floating-point multiply takes >6 cycles, but during the last two cycles another multiply can begin, so >effectively it takes four cycles if many multiplies occur back-to-back. >In reality, about 7 cycles elapse between multiplies. The code looks >something like (where A, B*, C, D, E, F are single-precision floating point >registers, and offset is a hard-coded constant): > > mul.s A, A, B1 > lwc1 C, offset($BASE) > add.s E, E, D > mul.s C, C, B2 ## 1 cycle stall if load takes 2 cycles > etc. > >Does anyone have any experience with this? Where are the extra 3 cycles going? >How long does it _really_ take to load a value from cache? If it does take a >lot more than 2 cycles, then I could relax make the subroutine a lot more >flexible. As I recall, the relevant pipelining rules of the R3010 are the following: 1) An ADD cannot start or finish in cycle in which a MUL starts or finishes 2) Only one instruction can start in any cycle 3) A load can finish in any cycle This means that the add cannot start until the multiply has completed Pipelining of instructions as coded: CYCLE 1 2 3 4 5 6 7 8 9 10 11 mul.s START------ ------ RESULT lwc1 START RESULT add.s START RESULT mul.s START ------ ------ RESULT lwc1 START RESULT add.s START This is six cycles per triple. If you can reorder the code a little, it should get faster: CYCLE 1 2 3 4 5 6 7 8 9 10 11 mul.s START------ ------ RESULT add.s START RESULT lwc1 START RESULT mul.s START ------ ------ RESULT add.s START RESULT lwc1 START RESULT This is three cycles per triple. Chris Rowen